System Name | Apollo |
---|---|
Processor | Intel Core i9 9880H |
Motherboard | Some proprietary Apple thing. |
Memory | 64GB DDR4-2667 |
Video Card(s) | AMD Radeon Pro 5600M, 8GB HBM2 |
Storage | 1TB Apple NVMe, 4TB External |
Display(s) | Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays |
Case | MacBook Pro (16", 2019) |
Audio Device(s) | AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers |
Power Supply | 96w Power Adapter |
Mouse | Logitech MX Master 3 |
Keyboard | Logitech G915, GL Clicky |
Software | MacOS 12.1 |
I think you're underestimating the benefits of improving the cache hit ratio. Most of the time in the environment I work in, caching performance is what determines a huge bit of performance since latency otherwise is dominated by reach out to do I/O. Granted, this is caching at a different level of the mem hierarchy, but the idea is the same. Every time you improve the hit ratio, you're improving performance because you're essentially taking a fraction of the time to do the same thing, not to mention that it won't get nearly as close to stalling the pipeline.That is indeed an assumption. Irrespective of that, I still don't think their core architecture is that much better, virtually all of SECS's tests are known to rely heavily on memory ops, favoring either huge caches or fast system memory.
System Name | Good enough |
---|---|
Processor | AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge |
Motherboard | ASRock B650 Pro RS |
Cooling | 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30 |
Memory | 32GB - FURY Beast RGB 5600 Mhz |
Video Card(s) | Sapphire RX 7900 XT - Alphacool Eisblock Aurora |
Storage | 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB |
Display(s) | LG UltraGear 32GN650-B + 4K Samsung TV |
Case | Phanteks NV7 |
Power Supply | GPS-750C |
I think you're underestimating the benefits of improving the cache hit ratio.
I’m pretty sure that the system is also constant enough. For zen3 it is jedec 3200 ram etc. For M1max it is whatever apple decided to pair it with. Pretty simple.IPC has never been a system level metric, it has always been processor specific. You can search for papers regarding measurements of IPC and you'll never come across a system level study of IPC because it doesn't make sense, the CPU is a constant, the system isn't. They always focus on isolating the characteristic of the CPU alone.
If the software you are using to calculate IPC with happens to use the GPU things get pretty complicated, but I don’t see why it should not be used. IPC is always application specific.Plus to say that it's a system level thing implies that everything should be measured together, what do we do if we want to measure FLOPS throughput ? Do you count the GPU and all the various other accelerators in as well ? After all it's all on the same SoC, same system, right ?
Upper bound isn’t interesting, and neither is lower bound. Trying to write software to get the min and max of a given processor is a pointless excercise. I’m interested in IPC measured with software that is actually used by people to do something productive.It results in more instruction been executed per clock some of the time, the upper and lower bounds of IPC and it's behavior remain exactly the same.
System Name | RyzenGtEvo/ Asus strix scar II |
---|---|
Processor | Amd R5 5900X/ Intel 8750H |
Motherboard | Crosshair hero8 impact/Asus |
Cooling | 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK |
Memory | Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB |
Video Card(s) | Powercolour RX7900XT Reference/Rtx 2060 |
Storage | Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme |
Display(s) | Samsung UAE28"850R 4k freesync.dell shiter |
Case | Lianli 011 dynamic/strix scar2 |
Audio Device(s) | Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset |
Power Supply | corsair 1200Hxi/Asus stock |
Mouse | Roccat Kova/ Logitech G wireless |
Keyboard | Roccat Aimo 120 |
VR HMD | Oculus rift |
Software | Win 10 Pro |
Benchmark Scores | 8726 vega 3dmark timespy/ laptop Timespy 6506 |
I don't think I like your definition of IPC thankfully we already have adequate ways to test.I’m pretty sure that the system is also constant enough. For zen3 it is jedec 3200 ram etc. For M1max it is whatever apple decided to pair it with. Pretty simple.
If the software you are using to calculate IPC with happens to use the GPU things get pretty complicated, but I don’t see why it should not be used. IPC is always application specific.
Upper bound isn’t interesting, and neither is lower bound. Trying to write software to get the min and max of a given processor is a pointless excercise. I’m interested in IPC measured with software that is actually used by people to do something productive.
edit: and for max, you can just read the wikichip page of a given processor, and check how many instructions it can dispatch every cycle. Is that something that relates to actual application performance? No.
System Name | Hotbox |
---|---|
Processor | AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6), |
Motherboard | ASRock Phantom Gaming B550 ITX/ax |
Cooling | LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14 |
Memory | 32GB G.Skill FlareX 3200c14 @3800c15 |
Video Card(s) | PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W |
Storage | 2TB Adata SX8200 Pro |
Display(s) | Dell U2711 main, AOC 24P2C secondary |
Case | SSUPD Meshlicious |
Audio Device(s) | Optoma Nuforce μDAC 3 |
Power Supply | Corsair SF750 Platinum |
Mouse | Logitech G603 |
Keyboard | Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps |
Software | Windows 10 Pro |
It really isn't. There's no reason to expect an M1 in a Mac Mini to throttle under a single-core workload - it's neither thermally nor power constrained. And a 5950X under any type of reasonable cooling can maintain its max turbo (or even exceed it) in any ST workload indefinitely.That is indeed an assumption. Irrespective of that, I still don't think their core architecture is that much better, virtually all of SECS's tests are known to rely heavily on memory ops, favoring either huge caches or fast system memory.
It won't - they don't even claim that. They claim ballpark 3080 mobile performance, but even in their vague and unlabeled graph they don't reach the same level. The Pro is compared to a mobile 3050 Ti, with the Max compared to the 3080 at 100 and 160W, beating the former and coming close to the latter.If the pro max matches the 3080 in workloads it will be impressive... mark gurman said the desktop m1 max is 128 core gpu thats 3090 desktop performance
Nobody is saying that, we're just recognizing that they're pulling off some pretty astounding performance from a ~3GHz ARM core, matching or beating the fastest X86 cores at a fraction of the power and clock speed.Sure, but if that's the case let's stop thinking that their chips are the greatest thing since sliced bread.
You'd be surprised. Not to mention the rich kids with $3000+ MBPs for school/studies etc. Though tbf, most of those have the air or at worst the 13" M1 MBP, which are both cheaper and less powerful. These will sell like hotcakes to photographers, videographers, animators, journalists, musicians, all kinds of creative professionals, and a whackton of image-obsessed rich people.I don't think anyone buys 3000$+ laptops for office work, or if they do they're incredibly unintelligent. What Apple knows is that people want to use some of the professional software available for mac, not necessarily their hardware.
Then why are AMD claiming a 15% IPC (in gaming workloads) bump from their stacked 3D cache? Given that those are among the most latency sensitive workloads, and a stacked, via-connected cache is a bit of a worst case scenario, that seems to contradict what you're saying here.L3 caches are sometimes too slow to provide meaningful speed ups, it's usually the L1 caches that get hammered.
Only if it manages to keep latencies down while increasing them - that's part of why Rocket Lake has such weird performance characteristics.Then it doesn't even make sense talking about IPC in that case, because any CPU will suddenly have higher IPC if it gets faster system memory or larger caches.
I think what they mean is that you can't measure IPC outside of the influence of the OS and its systems and workings - you need software to run and ways for that software to communicate with the CPU, after all. So in that way it is a system-level metric, as non-hardware changes can also affect it. The L3 latency bug in W11 would seem to noticeably lower Zen2/Zen3 IPC, for example.IPC has never been a system level metric, it has always been processor specific. You can search for papers regarding measurements of IPC and you'll never come across a system level study of IPC because it doesn't make sense, the CPU is a constant, the system isn't. They always focus on isolating the characteristic of the CPU alone.
Plus to say that it's a system level thing implies that everything should be measured together, what do we do if we want to measure FLOPS throughput ? Do you count the GPU and all the various other accelerators in as well ? After all it's all on the same SoC, same system, right ?
I agree that it's about time PC CPU manufacturers start breaking down some walls, but you don't quite seem to appreciate the scope of Apple's engineering on their recent SoCs ad CPU core architectures. I'd highly recommend reading AnandTech's M1/A14 deep dive, as it goes in depth (with self-written feature tests which are excellent illustrations) for everything from L1 cache behaviour to numbering the various execution ports in the CPU and estimating an overall layout of the CPU.I'm not, I'm just saying that everyone can throw more cache or faster system memory at the problem and get good results. I am just not impressed by that. Apple has paired server level system bandwidth with their SoC, yeah it's going to be very fast compared to a lot of other CPUs. I'd laughable if it wasn't.
On this note, I do find it really pathetic that PC manufactures haven't moved to a new configuration that allows for wider interfaces. It's insane that we have to wait for years on end so that we can move to a new DDR standard in order to get more bandwidth.
It is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.I don't think I like your definition of IPC thankfully we already have adequate ways to test.
IPC is chip ,no core specific everything else in a system is changeable.
And I do get your point , so do others that's why reviews exist showing different application performance metrics.
It is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.
Yep, WPC is a lot more meaningful metric, and also always system specific, as well as application specific in the same way.In the purest form, this is IPC.
But IPC is problematic across different instruction set.
Let say in theory, you have a CISC instruction that load a number, increment it by 1 then save it into the memory. It can do that accross 3 cycles. On the other side, you have a RISC CPU that need a load instruction, an increment instruction and and a store instruction to do the same amount of work, but each take 1 cycle to run. This mean this cpu run 3 time the number of instruction for the same amount of work. We could say it have 3x the IPC than the CISC cpu but in the end nothing more was done.
This is why in it's purest form. IPC is only a good comparison within the same Instruction Set. And it's probably only really useful to compare 2 cpu of the same manufacturer once you factor in the frequency they run.
Also, the same processor can get a higher IPC at lower frequency than at higher if it have to wait less for I/O or Memory. Waiting 60 ns for data to arrive at 2 GHz is less cycle loss than the same wait at 5 GHz. This is why it's hard to extract IPC in it's purest form from Benchmark.
What most people Call IPC these days is mostly a somehow standardized metric like the Spec Benchmark. It's no longer the amount of Instruction per clock but the amount of Work per clock. And in the end, that is what really matter.
but we should say WPC or something similar instead of IPC.
System Name | Good enough |
---|---|
Processor | AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge |
Motherboard | ASRock B650 Pro RS |
Cooling | 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30 |
Memory | 32GB - FURY Beast RGB 5600 Mhz |
Video Card(s) | Sapphire RX 7900 XT - Alphacool Eisblock Aurora |
Storage | 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB |
Display(s) | LG UltraGear 32GN650-B + 4K Samsung TV |
Case | Phanteks NV7 |
Power Supply | GPS-750C |
It's absolutely important to know the lower bounds because that tells you what the worst case scenarios is.Upper bound isn’t interesting, and neither is lower bound.
Trying to write software to get the min and max of a given processor is a pointless excercise.
That's such a bizarre thing to say. OK, you find out that a CPU can achieve X IPC in a certain application. What can you do with that information ? Absolutely nothing, people measure IPC to generalize what the performance characteristics are. If you are only interested in an application, as you say, then IPC measurements are pointless, you're actually just interested in the performance of that application.I’m interested in IPC measured with software that is actually used by people to do something productive.
You'd be surprised. Not to mention the rich kids with $3000+ MBPs for school/studies etc.
Then why are AMD claiming a 15% IPC (in gaming workloads) bump from their stacked 3D cache? Given that those are among the most latency sensitive workloads, and a stacked, via-connected cache is a bit of a worst case scenario, that seems to contradict what you're saying here.
I agree that it's about time PC CPU manufacturers start breaking down some walls, but you don't quite seem to appreciate the scope of Apple's engineering on their recent SoCs ad CPU core architectures. I'd highly recommend reading AnandTech's M1/A14 deep dive, as it goes in depth (with self-written feature tests which are excellent illustrations) for everything from L1 cache behaviour to numbering the various execution ports in the CPU and estimating an overall layout of the CPU.
Managing to design a CPU core this wide without significant performance or power penalties and managing to keep it fed is very impressive
System Name | Apollo |
---|---|
Processor | Intel Core i9 9880H |
Motherboard | Some proprietary Apple thing. |
Memory | 64GB DDR4-2667 |
Video Card(s) | AMD Radeon Pro 5600M, 8GB HBM2 |
Storage | 1TB Apple NVMe, 4TB External |
Display(s) | Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays |
Case | MacBook Pro (16", 2019) |
Audio Device(s) | AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers |
Power Supply | 96w Power Adapter |
Mouse | Logitech MX Master 3 |
Keyboard | Logitech G915, GL Clicky |
Software | MacOS 12.1 |
You can't just throw more bandwidth at a problem and expect it to go faster. Take way back when I used an i7 3820. Quad channel DDR3-2133 gives impressive bandwidth numbers compared to a 2700k, but the reality is that the 3820 was only something like 5-8% faster at stock, but that difference wasn't the clock speeds, it was the extra 2MB of L3 that the 3820 had over the regular SB i7 chips. So bandwidth alone doesn't make a chip faster, otherwise the 290(x)/390(x) should have been insanely fast when in reality, nVidia was doing the same with half the width.I'm not, I'm just saying that everyone can throw more cache or faster system memory at the problem and get good results. I am just not impressed by that. Apple has paired server level system bandwidth with their SoC, yeah it's going to be very fast compared to a lot of other CPUs. I'd laughable if it wasn't.
Wide memory interfaces for DRAM costs a lot of die space, power, and traces for the memory chips makes boards expensive to produce for it. It's not a good path forward for traditional DRAM. Now, I would agree with respect to HBM2 given its bandwidth and power characteristics, but it also comes with trade-offs in the sense that it's relatively expensive to produce. Apple is basically doing that with their DRAM, so they have the advantage of economy of scale.On this note, I do find it really pathetic that PC manufactures haven't moved to a new configuration that allows for wider interfaces. It's insane that we have to wait for years on end so that we can move to a new DDR standard in order to get more bandwidth. This is really the only concrete area where Intel and AMD can't do jack shit about, not by themselves anyway.
The lower bound is something stupidly low on modern processors, but exceedingly hard to achieve. You’ll need to write a test that causes the most cache misses and is impossible to predict. Is that something that any proper SW ever expiriences? No.It's absolutely important to know the lower bounds because that tells you what the worst case scenarios is.
I was talking about the min and max IPC. trying to measure them is pointless.Huh ? You always aim to use the most out of processor given whatever the time constraints are, I don't know what you are talking about.
System Name | Good enough |
---|---|
Processor | AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge |
Motherboard | ASRock B650 Pro RS |
Cooling | 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30 |
Memory | 32GB - FURY Beast RGB 5600 Mhz |
Video Card(s) | Sapphire RX 7900 XT - Alphacool Eisblock Aurora |
Storage | 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB |
Display(s) | LG UltraGear 32GN650-B + 4K Samsung TV |
Case | Phanteks NV7 |
Power Supply | GPS-750C |
It's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.I was talking about the min and max IPC. trying to measure them is pointless.
Not really, all it takes is one or two unfortunate instructions in a loop to have to have major performance implications. Happens all the time and you never know about it, you just hit compile and assume that that's just how it is. It can sometimes be as simple as choosing a 32 bit variable over a 64 bit one which leads to some weird instructions under the hood that may run abnormally slow on some processors.The lower bound is something stupidly low on modern processors, but exceedingly hard to achieve.
Nah, that is nowhere close to being at the lower bound. I updated some silly scenario that can be much much worse.It's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.
Not really, all it takes is one or two unfortunate instructions in a loop to have to have major performance implications. Happens all the time and you never know about it, you just hit compile and assume that that's just how it is. It can sometimes be as simple as choosing a 32 bit variable over a 64 bit one.
System Name | Hotbox |
---|---|
Processor | AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6), |
Motherboard | ASRock Phantom Gaming B550 ITX/ax |
Cooling | LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14 |
Memory | 32GB G.Skill FlareX 3200c14 @3800c15 |
Video Card(s) | PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W |
Storage | 2TB Adata SX8200 Pro |
Display(s) | Dell U2711 main, AOC 24P2C secondary |
Case | SSUPD Meshlicious |
Audio Device(s) | Optoma Nuforce μDAC 3 |
Power Supply | Corsair SF750 Platinum |
Mouse | Logitech G603 |
Keyboard | Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps |
Software | Windows 10 Pro |
I wasn't talking about how relevant the hardware was, I was responding to you stating that you don't think anyone buys $3000+ laptops for "office work", and your arguments against Apple knowing their audience. It's pretty clear that they do (in part because they've been pissing off their core creative audience for years, and are now finally delivering something they would want).I guess ? It's still stupid and that doesn't tell you anything about how relevant the hardware is if you argue that people who didn't actually needed one would buy it anyway.
Well, that's why they also have massive L1 and L2 caches (and actually no CPU L3 at all, but a system-level LLC). But it contradicts what you said because you said (without any reservations) that "L3 caches are sometimes too slow to provide meaningful speed ups", which ... well, if they didn't meaningfully do so in real-world workloads it would be pretty odd for AMD to invest millions into stacked L3 cache tech, IMO.I have no idea but I fail to see why it contradicts anything that I said. I just said that more L3 cache doesn't always translate to much improved performance. If your code and data mostly resides in L1 cache then messing around with the L3 cache wont do anything. Obviously real world workloads are a mixture of stuff that benefits more or less from different levels of caches or from none of them at all.
...and? Is appreciating high-end engineering wrong? I haven't seen a single article that comes even close to the level of depth and quality of analysis of these articles. And nothing to contradict anything said either.I read many of his articles and while they're very good I can't help but notice he has a particular affinity for everything Apple does.
You could say that, but only if you ignore the latencies and how they're keeping the cores fed. As I said, increasing cache size should balloon latency, yet theirs is lower than the competition despite 3-6x larger caches. And with that wide a core, you're really starting to push the boundaries of what can be effectively fed with conventional software - yet they're pulling it off. It would also be expected that this much larger die, even at lower clock speeds, would be rather power hungry for what it does - yet it isn't. This is no doubt largely down to granular power gating and the large caches saving them a lot of data shuffling (especially into/out of RAM), but that isn't the whole story.A wide core with huge caches and most importantly a very conservative clock speed. That's why I am not impressed, trust me that if their chips ran at much higher clocks comparable to Intel's an AMD's chips while retaining the same characteristics then I'd be really impressed. But I know that's not possible, ultimately all they did is a simple clock for area/transistor budget tradeoff because that way efficiency increases more than linearly.
Their core are absolutely massive, that is absolutely true. But so what? They're still managing to use them in smartphones(!) and thin-and-light laptops. This mainly demonstrates that Apple is less margin conscious on this level than AMD and Intel - which is very understandable. That clearly makes this core less suited for budget devices. But less impressive? Nah. A 5950X is a $750 CPU. If Apple sold these at retail they'd no doubt be more than that, but we're not comparing to budget devices, we're comparing to the best they're putting out.I just cannot give them much credit when they outperform Intel and AMD in some metrics while using who knows, maybe several times more transistors per core ?
No, that's why we have industry-standard benchmarks based on real-world workloads. It's obvious that no such thing will ever be perfect, but it is a reasonable approximation of performance across a wide range of real-world usage scenarios.It's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.
System Name | RyzenGtEvo/ Asus strix scar II |
---|---|
Processor | Amd R5 5900X/ Intel 8750H |
Motherboard | Crosshair hero8 impact/Asus |
Cooling | 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK |
Memory | Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB |
Video Card(s) | Powercolour RX7900XT Reference/Rtx 2060 |
Storage | Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme |
Display(s) | Samsung UAE28"850R 4k freesync.dell shiter |
Case | Lianli 011 dynamic/strix scar2 |
Audio Device(s) | Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset |
Power Supply | corsair 1200Hxi/Asus stock |
Mouse | Roccat Kova/ Logitech G wireless |
Keyboard | Roccat Aimo 120 |
VR HMD | Oculus rift |
Software | Win 10 Pro |
Benchmark Scores | 8726 vega 3dmark timespy/ laptop Timespy 6506 |
Crack on show a link then no one calls it a system measurement ,IPC is not a system measurement.It is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.
System Name | Good enough |
---|---|
Processor | AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge |
Motherboard | ASRock B650 Pro RS |
Cooling | 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30 |
Memory | 32GB - FURY Beast RGB 5600 Mhz |
Video Card(s) | Sapphire RX 7900 XT - Alphacool Eisblock Aurora |
Storage | 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB |
Display(s) | LG UltraGear 32GN650-B + 4K Samsung TV |
Case | Phanteks NV7 |
Power Supply | GPS-750C |
But it's exceedingly easy to stumble across, it doesn't have to be silly.Nah, that is nowhere close to being at the lower bound.
It's ridiculously inefficient and it's not scalable. It's funny, they used to make fun of other smartphone manufacturers because "everyone can make something bigger". Now they do the same with their chips.But so what?
I don't get it, what's so hard to understand about the word "sometimes" ?Well, that's why they also have massive L1 and L2 caches (and actually no CPU L3 at all, but a system-level LLC). But it contradicts what you said because you said (without any reservations) that "L3 caches are sometimes too slow to provide meaningful speed ups", which ... well, if they didn't meaningfully do so in real-world workloads it would be pretty odd for AMD to invest millions into stacked L3 cache tech, IMO.
It all has to do with TLB and the size of pages that it uses to map the memory space. Basically Apple is using larger pages than on x86 platforms so the search space for the same amount of memory is smaller, hence lower latencies.But given that Intel's latest L1 cache size increase (24 to 32K, IIRC) came with a 1-cycle latency penalty, I can't quite see how they (or AMD) would suddenly pull a 3-6x increase in cache sizes out of their sleeves without also dramatically increasing latencies, which begs the question of whether others would even be able to make a similarly huge, wide, and cache-rich core without it being hobbled by slow cache accesses and thus not being fed. That seems to be the case, as we would otherwise most likely see much wider designs for servers and other markets where costs don't matter.
But exceedingly far from the actual lower bound IPC, which you said was somehow important to know.But it's exceedingly easy to stumble across, it doesn't have to be silly.
Of course it’s indicative of only the performance measured. I’d never generalize IPC measurements to overall performance.Crack on show a link then no one calls it a system measurement ,IPC is not a system measurement.
It's a core micro architecture measurement, and is only indicative of performance not demonstrative of all performance.
System Name | Good enough |
---|---|
Processor | AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge |
Motherboard | ASRock B650 Pro RS |
Cooling | 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30 |
Memory | 32GB - FURY Beast RGB 5600 Mhz |
Video Card(s) | Sapphire RX 7900 XT - Alphacool Eisblock Aurora |
Storage | 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB |
Display(s) | LG UltraGear 32GN650-B + 4K Samsung TV |
Case | Phanteks NV7 |
Power Supply | GPS-750C |
How is it "exceedingly" far ? Even misaligned SSE/AVX loads or the odd floating point division here and there can destroy the IPC, you don't even have to mess around with the code find out how you can create the most cache misses or mispredictions. Plus you're not even testing the IPC at that point, you're just writing terrible code in general.But exceedingly far from the actual lower bound IPC, which you said was somehow important to know.
System Name | Hotbox |
---|---|
Processor | AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6), |
Motherboard | ASRock Phantom Gaming B550 ITX/ax |
Cooling | LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14 |
Memory | 32GB G.Skill FlareX 3200c14 @3800c15 |
Video Card(s) | PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W |
Storage | 2TB Adata SX8200 Pro |
Display(s) | Dell U2711 main, AOC 24P2C secondary |
Case | SSUPD Meshlicious |
Audio Device(s) | Optoma Nuforce μDAC 3 |
Power Supply | Corsair SF750 Platinum |
Mouse | Logitech G603 |
Keyboard | Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps |
Software | Windows 10 Pro |
Well, depends what type of efficiency you're looking for. It's not efficient per area or transistor, but per unit of power or clock speed? Massively so.It's ridiculously inefficient and it's not scalable. It's funny, they used to make fun of other smartphone manufacturers because "everyone can make something bigger". Now they do the same with their chips.
Apparently equally hard as it is to understand that your "sometimes" in this case isn't particularly applicable, neither in this case nor in other relevant comparisons. That doesn't mean it's untrue, it just means it's not particularly relevant as an objection.I don't get it, what's so hard to understand about the word "sometimes" ?
Again: if it was that simple, why aren't everyone doing that? Given how many server chips Intel sells, if they could make a huge core like this for servers and deliver 50% higher IPC and ISO performance at half the power per core, they would do so, regardless of the area needed. You could always blame server vendors for not wanting to adopt such a system, but frankly I don't think that would be a problem. Google and Facebook would gobble them up for R&D purposes if nothing else, and they wouldn't care if the CPUs were $10 000 apiece. (Also, they use 16KB TLB pages, but they are also compatible with 4KB pages.)It all has to do with TLB and the size of pages that it uses to map the memory space. Basically Apple is using larger pages than on x86 platforms so the search space for the same amount of memory is smaller, hence lower latencies.
System Name | Good enough |
---|---|
Processor | AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge |
Motherboard | ASRock B650 Pro RS |
Cooling | 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30 |
Memory | 32GB - FURY Beast RGB 5600 Mhz |
Video Card(s) | Sapphire RX 7900 XT - Alphacool Eisblock Aurora |
Storage | 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB |
Display(s) | LG UltraGear 32GN650-B + 4K Samsung TV |
Case | Phanteks NV7 |
Power Supply | GPS-750C |
Again: if it was that simple, why aren't everyone doing that?
It's not efficient per area or transistor, but per unit of power or clock speed? Massively so.
It's not that the end product isn't impressive, it's how they got there than isn't.And that's the key here: the individual parts of what Apple is doing here might not be that impressive, but that they're managing to make all of this into a functional, well balanced and highly performant and efficient core? That is impressive. Very much so.
But IPC is always application specific!!!How is it "exceedingly" far ? Even misaligned SSE/AVX loads or the odd floating point division here and there can destroy the IPC, you don't even have to mess around with the code find out how you can create the most cache misses or mispredictions. Plus you're not even testing the IPC at that point, you're just writing terrible code in general.
System Name | Good enough |
---|---|
Processor | AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge |
Motherboard | ASRock B650 Pro RS |
Cooling | 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30 |
Memory | 32GB - FURY Beast RGB 5600 Mhz |
Video Card(s) | Sapphire RX 7900 XT - Alphacool Eisblock Aurora |
Storage | 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB |
Display(s) | LG UltraGear 32GN650-B + 4K Samsung TV |
Case | Phanteks NV7 |
Power Supply | GPS-750C |
Then what's the point of measuring it ? What do you do with that information ?But IPC is always application specific!!!
Terrible IPC always requires terrible code to go with it.
System Name | Primary|Secondary|Poweredge r410|Dell XPS|SteamDeck |
---|---|
Processor | i7 11700k|i7 9700k|2 x E5620 |i5 5500U|Zen 2 4c/8t |
Memory | 32GB DDR4|16GB DDR4|16GB DDR4|32GB ECC DDR3|8GB DDR4|16GB LPDDR5 |
Video Card(s) | RX 7800xt|RX 6700xt |On-Board|On-Board|8 RDNA 2 CUs |
Storage | 2TB m.2|512GB SSD+1TB SSD|2x256GBSSD 2x2TBGB|256GB sata|512GB nvme |
Display(s) | 50" 4k TV | Dell 27" |22" |3.3"|7" |
VR HMD | Samsung Odyssey+ | Oculus Quest 2 |
Software | Windows 11 Pro|Windows 10 Pro|Windows 10 Home| Server 2012 r2|Windows 10 Pro |
You are the one who insisted that knowing the lower bound of the IPC was important!!Then what's the point of measuring it ? What do you do with that information ?