Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

R0H1T · Oct 19, 2021

You're exaggerating, it depends on the program & OS.

Aquinus · Oct 19, 2021

Vya Domus said:
That is indeed an assumption. Irrespective of that, I still don't think their core architecture is that much better, virtually all of SECS's tests are known to rely heavily on memory ops, favoring either huge caches or fast system memory.

I think you're underestimating the benefits of improving the cache hit ratio. Most of the time in the environment I work in, caching performance is what determines a huge bit of performance since latency otherwise is dominated by reach out to do I/O. Granted, this is caching at a different level of the mem hierarchy, but the idea is the same. Every time you improve the hit ratio, you're improving performance because you're essentially taking a fraction of the time to do the same thing, not to mention that it won't get nearly as close to stalling the pipeline.

Just look at AMD. Infinity cache is serving a very important purpose and it's the same purpose as why Apple has a very large cache as well. More cache means better hit ratios which yields better performance. It might seem like an oversimplification, but it's really not.

Vya Domus · Oct 19, 2021

Aquinus said:
I think you're underestimating the benefits of improving the cache hit ratio.

I'm not, I'm just saying that everyone can throw more cache or faster system memory at the problem and get good results. I am just not impressed by that. Apple has paired server level system bandwidth with their SoC, yeah it's going to be very fast compared to a lot of other CPUs. I'd laughable if it wasn't.

On this note, I do find it really pathetic that PC manufactures haven't moved to a new configuration that allows for wider interfaces. It's insane that we have to wait for years on end so that we can move to a new DDR standard in order to get more bandwidth. This is really the only concrete area where Intel and AMD can't do jack shit about, not by themselves anyway.

Dredi · Oct 19, 2021

Vya Domus said:
IPC has never been a system level metric, it has always been processor specific. You can search for papers regarding measurements of IPC and you'll never come across a system level study of IPC because it doesn't make sense, the CPU is a constant, the system isn't. They always focus on isolating the characteristic of the CPU alone.

I’m pretty sure that the system is also constant enough. For zen3 it is jedec 3200 ram etc. For M1max it is whatever apple decided to pair it with. Pretty simple.

Vya Domus said:
Plus to say that it's a system level thing implies that everything should be measured together, what do we do if we want to measure FLOPS throughput ? Do you count the GPU and all the various other accelerators in as well ? After all it's all on the same SoC, same system, right ?

If the software you are using to calculate IPC with happens to use the GPU things get pretty complicated, but I don’t see why it should not be used. IPC is always application specific.

Vya Domus said:
It results in more instruction been executed per clock some of the time, the upper and lower bounds of IPC and it's behavior remain exactly the same.

Upper bound isn’t interesting, and neither is lower bound. Trying to write software to get the min and max of a given processor is a pointless excercise. I’m interested in IPC measured with software that is actually used by people to do something productive.

edit: and for max, you can just read the wikichip page of a given processor, and check how many instructions it can dispatch every cycle. Is that something that relates to actual application performance? No.

TheoneandonlyMrK · Oct 19, 2021

Dredi said:
I’m pretty sure that the system is also constant enough. For zen3 it is jedec 3200 ram etc. For M1max it is whatever apple decided to pair it with. Pretty simple.

If the software you are using to calculate IPC with happens to use the GPU things get pretty complicated, but I don’t see why it should not be used. IPC is always application specific.

Upper bound isn’t interesting, and neither is lower bound. Trying to write software to get the min and max of a given processor is a pointless excercise. I’m interested in IPC measured with software that is actually used by people to do something productive.

edit: and for max, you can just read the wikichip page of a given processor, and check how many instructions it can dispatch every cycle. Is that something that relates to actual application performance? No.

I don't think I like your definition of IPC thankfully we already have adequate ways to test.

IPC is chip ,no core specific everything else in a system is changeable.

And I do get your point , so do others that's why reviews exist showing different application performance metrics.

Valantar · Oct 19, 2021

Vya Domus said:
That is indeed an assumption. Irrespective of that, I still don't think their core architecture is that much better, virtually all of SECS's tests are known to rely heavily on memory ops, favoring either huge caches or fast system memory.

It really isn't. There's no reason to expect an M1 in a Mac Mini to throttle under a single-core workload - it's neither thermally nor power constrained. And a 5950X under any type of reasonable cooling can maintain its max turbo (or even exceed it) in any ST workload indefinitely.

Richards said:
If the pro max matches the 3080 in workloads it will be impressive... mark gurman said the desktop m1 max is 128 core gpu thats 3090 desktop performance

It won't - they don't even claim that. They claim ballpark 3080 mobile performance, but even in their vague and unlabeled graph they don't reach the same level. The Pro is compared to a mobile 3050 Ti, with the Max compared to the 3080 at 100 and 160W, beating the former and coming close to the latter.

Vya Domus said:
Sure, but if that's the case let's stop thinking that their chips are the greatest thing since sliced bread.

Nobody is saying that, we're just recognizing that they're pulling off some pretty astounding performance from a ~3GHz ARM core, matching or beating the fastest X86 cores at a fraction of the power and clock speed.

Vya Domus said:
I don't think anyone buys 3000$+ laptops for office work, or if they do they're incredibly unintelligent. What Apple knows is that people want to use some of the professional software available for mac, not necessarily their hardware.

You'd be surprised. Not to mention the rich kids with $3000+ MBPs for school/studies etc. Though tbf, most of those have the air or at worst the 13" M1 MBP, which are both cheaper and less powerful. These will sell like hotcakes to photographers, videographers, animators, journalists, musicians, all kinds of creative professionals, and a whackton of image-obsessed rich people.

Vya Domus said:
L3 caches are sometimes too slow to provide meaningful speed ups, it's usually the L1 caches that get hammered.

Then why are AMD claiming a 15% IPC (in gaming workloads) bump from their stacked 3D cache? Given that those are among the most latency sensitive workloads, and a stacked, via-connected cache is a bit of a worst case scenario, that seems to contradict what you're saying here.

Vya Domus said:
Then it doesn't even make sense talking about IPC in that case, because any CPU will suddenly have higher IPC if it gets faster system memory or larger caches.

Only if it manages to keep latencies down while increasing them - that's part of why Rocket Lake has such weird performance characteristics.

Vya Domus said:
IPC has never been a system level metric, it has always been processor specific. You can search for papers regarding measurements of IPC and you'll never come across a system level study of IPC because it doesn't make sense, the CPU is a constant, the system isn't. They always focus on isolating the characteristic of the CPU alone.

Plus to say that it's a system level thing implies that everything should be measured together, what do we do if we want to measure FLOPS throughput ? Do you count the GPU and all the various other accelerators in as well ? After all it's all on the same SoC, same system, right ?

I think what they mean is that you can't measure IPC outside of the influence of the OS and its systems and workings - you need software to run and ways for that software to communicate with the CPU, after all. So in that way it is a system-level metric, as non-hardware changes can also affect it. The L3 latency bug in W11 would seem to noticeably lower Zen2/Zen3 IPC, for example.

Vya Domus said:
I'm not, I'm just saying that everyone can throw more cache or faster system memory at the problem and get good results. I am just not impressed by that. Apple has paired server level system bandwidth with their SoC, yeah it's going to be very fast compared to a lot of other CPUs. I'd laughable if it wasn't.

On this note, I do find it really pathetic that PC manufactures haven't moved to a new configuration that allows for wider interfaces. It's insane that we have to wait for years on end so that we can move to a new DDR standard in order to get more bandwidth.

I agree that it's about time PC CPU manufacturers start breaking down some walls, but you don't quite seem to appreciate the scope of Apple's engineering on their recent SoCs ad CPU core architectures. I'd highly recommend reading AnandTech's M1/A14 deep dive, as it goes in depth (with self-written feature tests which are excellent illustrations) for everything from L1 cache behaviour to numbering the various execution ports in the CPU and estimating an overall layout of the CPU.

In short, what it shows is that Apple is somehow managing L1 and L2 caches several times the size of the competition (6x the L1I size!) with lower latency - which is downright incredible, as conventional logic says that any cache size increase will increase latency too (which has borne out over several generations of Intel and AMD CPUs, for example) - while also having re-order buffers 2-3x the size of Intel and AMD, an 8-wide (compared to 4-wide for both Intel and AMD) decoder, and 2-3x the execution ports, etc. Managing to design a CPU core this wide without significant performance or power penalties and managing to keep it fed is very impressive - and likely highly dependent on tightly integrated RAM, as well as those massive caches, but that doesn't take away from the performance results. The main drawback of ultra-wide core designs is clock speeds, but Apple seems to be doing decently there as well with >3GHz sustained and even 3GHz on the mobile A14.

Is this "the best CPU out there"? Not necessarily. That depends on your use case and software needs. But is it the most advanced architecture out there? Without a doubt. Do AMD and Intel have their work cut out for them to keep up, let alone catch up? Absolutely.

Me? I really hope this leads AMD to bet on more integrated APUs, and unified memory. I would love a balls-to-the-wall APU with heaps of LPDDR5 for my next laptop. 20-30CUs at low clocks? That would be amazing. It wouldn't be cheap, but it would be fantastic, as long as they can get unified memory working in Windows.

Dredi · Oct 19, 2021

TheoneandonlyMrK said:
I don't think I like your definition of IPC thankfully we already have adequate ways to test.

IPC is chip ,no core specific everything else in a system is changeable.

And I do get your point , so do others that's why reviews exist showing different application performance metrics.

It is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.

Punkenjoy · Oct 19, 2021

The PC here have it's inconvenient of their advantages and these Apple M1* have the advantages of their inconvenient.

The PC parts need to be supported in various system, they need to be upgradable. (like expanding memory). This is an advantages over the M1 but the inconvenient is slower standard adoption, more latency due to the fact that the memory isn't standard, and is further away from the CPU. They also have less flexibility on the memory design since adding channels require a new socket.

On the M1 part, they are specifically designed for specific form factor. The memory isn't upgradable and is being soldered on the motherboard close to the CPU. Their design allow them to scale up and down the memory bus and adopt new standard rapidly since they don't have to deal with a standard form factor for upgrade .This also allow them to have the memory very close for better latency and better energy efficiency. But if you want to get more memory because you didn't buy large enough, you have to buy a new device. This is good for apple because people will tend to buy higher than they need because they will not want to have a costly upgrade later.

Apple is just pushing their advantages since no one seems to care about their inconvenient on their platform. But if AMD and Intel would do something similar, many PC enthusiast wouldn't like that.

It still make a lot of sense to do on a laptop since a lot of the time, it will never be upgraded. Also Apple own their entire stack. If they want to put an accelerator, they can leverage an API in the OS and make their compiler to use it whenever it needed.

In reality, i think they are where they are supposed to be regarding their own performance. The fact isn't that they outperform now, it's that they sucked for 2 decade being slowed down by Intel chips. They are just where a company that own their full stack should be right now.

And the fun things is you can buy if you want, and you can buy a PC if you prefer. PC isn't dead.

Dredi said:
It is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.

In the purest form, this is IPC.

But IPC is problematic across different instruction set.
Let say in theory, you have a CISC instruction that load a number, increment it by 1 then save it into the memory. It can do that accross 3 cycles. On the other side, you have a RISC CPU that need a load instruction, an increment instruction and and a store instruction to do the same amount of work, but each take 1 cycle to run. This mean this cpu run 3 time the number of instruction for the same amount of work. We could say it have 3x the IPC than the CISC cpu but in the end nothing more was done.

This is why in it's purest form. IPC is only a good comparison within the same Instruction Set. And it's probably only really useful to compare 2 cpu of the same manufacturer once you factor in the frequency they run.

Also, the same processor can get a higher IPC at lower frequency than at higher if it have to wait less for I/O or Memory. Waiting 60 ns for data to arrive at 2 GHz is less cycle loss than the same wait at 5 GHz. This is why it's hard to extract IPC in it's purest form from Benchmark.

What most people Call IPC these days is mostly a somehow standardized metric like the Spec Benchmark. It's no longer the amount of Instruction per clock but the amount of Work per clock. And in the end, that is what really matter.

but we should say WPC or something similar instead of IPC.

Dredi · Oct 19, 2021

Punkenjoy said:
In the purest form, this is IPC.

But IPC is problematic across different instruction set.
Let say in theory, you have a CISC instruction that load a number, increment it by 1 then save it into the memory. It can do that accross 3 cycles. On the other side, you have a RISC CPU that need a load instruction, an increment instruction and and a store instruction to do the same amount of work, but each take 1 cycle to run. This mean this cpu run 3 time the number of instruction for the same amount of work. We could say it have 3x the IPC than the CISC cpu but in the end nothing more was done.

This is why in it's purest form. IPC is only a good comparison within the same Instruction Set. And it's probably only really useful to compare 2 cpu of the same manufacturer once you factor in the frequency they run.

Also, the same processor can get a higher IPC at lower frequency than at higher if it have to wait less for I/O or Memory. Waiting 60 ns for data to arrive at 2 GHz is less cycle loss than the same wait at 5 GHz. This is why it's hard to extract IPC in it's purest form from Benchmark.

What most people Call IPC these days is mostly a somehow standardized metric like the Spec Benchmark. It's no longer the amount of Instruction per clock but the amount of Work per clock. And in the end, that is what really matter.

but we should say WPC or something similar instead of IPC.

Yep, WPC is a lot more meaningful metric, and also always system specific, as well as application specific in the same way.

IPC is a bit silly, as for example avx512 lowers IPC, but improves WPC.

Vya Domus · Oct 19, 2021

Dredi said:
Upper bound isn’t interesting, and neither is lower bound.

It's absolutely important to know the lower bounds because that tells you what the worst case scenarios is.

Dredi said:
Trying to write software to get the min and max of a given processor is a pointless excercise.

Huh ? You always aim to use the most out of processor given whatever the time constraints are, I don't know what you are talking about.

Dredi said:
I’m interested in IPC measured with software that is actually used by people to do something productive.

That's such a bizarre thing to say. OK, you find out that a CPU can achieve X IPC in a certain application. What can you do with that information ? Absolutely nothing, people measure IPC to generalize what the performance characteristics are. If you are only interested in an application, as you say, then IPC measurements are pointless, you're actually just interested in the performance of that application.

Valantar said:
You'd be surprised. Not to mention the rich kids with $3000+ MBPs for school/studies etc.

I guess ? It's still stupid and that doesn't tell you anything about how relevant the hardware is if you argue that people who didn't actually needed one would buy it anyway.

Valantar said:
Then why are AMD claiming a 15% IPC (in gaming workloads) bump from their stacked 3D cache? Given that those are among the most latency sensitive workloads, and a stacked, via-connected cache is a bit of a worst case scenario, that seems to contradict what you're saying here.

I have no idea but I fail to see why it contradicts anything that I said. I just said that more L3 cache doesn't always translate to much improved performance. If your code and data mostly resides in L1 cache then messing around with the L3 cache wont do anything. Obviously real world workloads are a mixture of stuff that benefits more or less from different levels of caches or from none of them at all.

Valantar said:
I agree that it's about time PC CPU manufacturers start breaking down some walls, but you don't quite seem to appreciate the scope of Apple's engineering on their recent SoCs ad CPU core architectures. I'd highly recommend reading AnandTech's M1/A14 deep dive, as it goes in depth (with self-written feature tests which are excellent illustrations) for everything from L1 cache behaviour to numbering the various execution ports in the CPU and estimating an overall layout of the CPU.

I read many of his articles and while they're very good I can't help but notice he has a particular affinity for everything Apple does.

Valantar said:
Managing to design a CPU core this wide without significant performance or power penalties and managing to keep it fed is very impressive

A wide core with huge caches and most importantly a very conservative clock speed. That's why I am not impressed, trust me that if their chips ran at much higher clocks comparable to Intel's an AMD's chips while retaining the same characteristics then I'd be really impressed. But I know that's not possible, ultimately all they did is a simple clock for area/transistor budget tradeoff because that way efficiency increases more than linearly. I just cannot give them much credit when they outperform Intel and AMD in some metrics while using who knows, maybe several times more transistors per core ?

Aquinus · Oct 19, 2021

Vya Domus said:
I'm not, I'm just saying that everyone can throw more cache or faster system memory at the problem and get good results. I am just not impressed by that. Apple has paired server level system bandwidth with their SoC, yeah it's going to be very fast compared to a lot of other CPUs. I'd laughable if it wasn't.

You can't just throw more bandwidth at a problem and expect it to go faster. Take way back when I used an i7 3820. Quad channel DDR3-2133 gives impressive bandwidth numbers compared to a 2700k, but the reality is that the 3820 was only something like 5-8% faster at stock, but that difference wasn't the clock speeds, it was the extra 2MB of L3 that the 3820 had over the regular SB i7 chips. So bandwidth alone doesn't make a chip faster, otherwise the 290(x)/390(x) should have been insanely fast when in reality, nVidia was doing the same with half the width.

So to make a long story short, how the different levels of the memory hierarchy are built out really influences how it benefits the SoC as a whole. A huge LLC won't do you a whole lot of good if your L2 is absolutely tiny. So it's a bit more complicated than just throwing more of x, y, or z at a problem.

Vya Domus said:
On this note, I do find it really pathetic that PC manufactures haven't moved to a new configuration that allows for wider interfaces. It's insane that we have to wait for years on end so that we can move to a new DDR standard in order to get more bandwidth. This is really the only concrete area where Intel and AMD can't do jack shit about, not by themselves anyway.

Wide memory interfaces for DRAM costs a lot of die space, power, and traces for the memory chips makes boards expensive to produce for it. It's not a good path forward for traditional DRAM. Now, I would agree with respect to HBM2 given its bandwidth and power characteristics, but it also comes with trade-offs in the sense that it's relatively expensive to produce. Apple is basically doing that with their DRAM, so they have the advantage of economy of scale.

Dredi · Oct 19, 2021

Vya Domus said:
It's absolutely important to know the lower bounds because that tells you what the worst case scenarios is.

The lower bound is something stupidly low on modern processors, but exceedingly hard to achieve. You’ll need to write a test that causes the most cache misses and is impossible to predict. Is that something that any proper SW ever expiriences? No.

Zen3 with 3200 jedec has ram latency of 80ns or so. The worst possible IPC i can think of would require the most program instructions to be fetched from ram. That requires just a huge 3d LUT check and goto based on that. So one ram latency per two instructions, meaning an ipc of around 1/200. If the prediction logic can see through that, you’ll need to add some stupid instruction to do an address conversion that cannot easily be predicted (some hash function maybe, that has a single instruction in some extension) and you end up with an IPC of around 1/133.

edit: a cleaner solution would to just write a simple routine that reads a byte at addr, then writes some hash (the processors must have some hash extension, so that it is simply one instruction) of byte to addr and loops. That would produce an ipc of 1/100 or so.

Vya Domus said:
Huh ? You always aim to use the most out of processor given whatever the time constraints are, I don't know what you are talking about.

I was talking about the min and max IPC. trying to measure them is pointless.

Vya Domus · Oct 19, 2021

Dredi said:
I was talking about the min and max IPC. trying to measure them is pointless.

It's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.

Dredi said:
The lower bound is something stupidly low on modern processors, but exceedingly hard to achieve.

Not really, all it takes is one or two unfortunate instructions in a loop to have to have major performance implications. Happens all the time and you never know about it, you just hit compile and assume that that's just how it is. It can sometimes be as simple as choosing a 32 bit variable over a 64 bit one which leads to some weird instructions under the hood that may run abnormally slow on some processors.

Dredi · Oct 19, 2021

Vya Domus said:
It's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.

Not really, all it takes is one or two unfortunate instructions in a loop to have to have major performance implications. Happens all the time and you never know about it, you just hit compile and assume that that's just how it is. It can sometimes be as simple as choosing a 32 bit variable over a 64 bit one.

Nah, that is nowhere close to being at the lower bound. I updated some silly scenario that can be much much worse.

Valantar · Oct 19, 2021

Vya Domus said:
I guess ? It's still stupid and that doesn't tell you anything about how relevant the hardware is if you argue that people who didn't actually needed one would buy it anyway.

I wasn't talking about how relevant the hardware was, I was responding to you stating that you don't think anyone buys $3000+ laptops for "office work", and your arguments against Apple knowing their audience. It's pretty clear that they do (in part because they've been pissing off their core creative audience for years, and are now finally delivering something they would want).

Vya Domus said:
I have no idea but I fail to see why it contradicts anything that I said. I just said that more L3 cache doesn't always translate to much improved performance. If your code and data mostly resides in L1 cache then messing around with the L3 cache wont do anything. Obviously real world workloads are a mixture of stuff that benefits more or less from different levels of caches or from none of them at all.

Well, that's why they also have massive L1 and L2 caches (and actually no CPU L3 at all, but a system-level LLC). But it contradicts what you said because you said (without any reservations) that "L3 caches are sometimes too slow to provide meaningful speed ups", which ... well, if they didn't meaningfully do so in real-world workloads it would be pretty odd for AMD to invest millions into stacked L3 cache tech, IMO.

Vya Domus said:
I read many of his articles and while they're very good I can't help but notice he has a particular affinity for everything Apple does.

...and? Is appreciating high-end engineering wrong? I haven't seen a single article that comes even close to the level of depth and quality of analysis of these articles. And nothing to contradict anything said either.

Vya Domus said:
A wide core with huge caches and most importantly a very conservative clock speed. That's why I am not impressed, trust me that if their chips ran at much higher clocks comparable to Intel's an AMD's chips while retaining the same characteristics then I'd be really impressed. But I know that's not possible, ultimately all they did is a simple clock for area/transistor budget tradeoff because that way efficiency increases more than linearly.

You could say that, but only if you ignore the latencies and how they're keeping the cores fed. As I said, increasing cache size should balloon latency, yet theirs is lower than the competition despite 3-6x larger caches. And with that wide a core, you're really starting to push the boundaries of what can be effectively fed with conventional software - yet they're pulling it off. It would also be expected that this much larger die, even at lower clock speeds, would be rather power hungry for what it does - yet it isn't. This is no doubt largely down to granular power gating and the large caches saving them a lot of data shuffling (especially into/out of RAM), but that isn't the whole story.

Vya Domus said:
I just cannot give them much credit when they outperform Intel and AMD in some metrics while using who knows, maybe several times more transistors per core ?

Their core are absolutely massive, that is absolutely true. But so what? They're still managing to use them in smartphones(!) and thin-and-light laptops. This mainly demonstrates that Apple is less margin conscious on this level than AMD and Intel - which is very understandable. That clearly makes this core less suited for budget devices. But less impressive? Nah. A 5950X is a $750 CPU. If Apple sold these at retail they'd no doubt be more than that, but we're not comparing to budget devices, we're comparing to the best they're putting out.

But given that Intel's latest L1 cache size increase (24 to 32K, IIRC) came with a 1-cycle latency penalty, I can't quite see how they (or AMD) would suddenly pull a 3-6x increase in cache sizes out of their sleeves without also dramatically increasing latencies, which begs the question of whether others would even be able to make a similarly huge, wide, and cache-rich core without it being hobbled by slow cache accesses and thus not being fed. That seems to be the case, as we would otherwise most likely see much wider designs for servers and other markets where costs don't matter.

Vya Domus said:
It's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.

No, that's why we have industry-standard benchmarks based on real-world workloads. It's obvious that no such thing will ever be perfect, but it is a reasonable approximation of performance across a wide range of real-world usage scenarios.

TheoneandonlyMrK · Oct 19, 2021

Dredi said:
It is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.

Crack on show a link then no one calls it a system measurement ,IPC is not a system measurement.
It's a core micro architecture measurement, and is only indicative of performance not demonstrative of all performance.

Vya Domus · Oct 19, 2021

Dredi said:
Nah, that is nowhere close to being at the lower bound.

But it's exceedingly easy to stumble across, it doesn't have to be silly.

Valantar said:
But so what?

It's ridiculously inefficient and it's not scalable. It's funny, they used to make fun of other smartphone manufacturers because "everyone can make something bigger". Now they do the same with their chips.

Valantar said:
Well, that's why they also have massive L1 and L2 caches (and actually no CPU L3 at all, but a system-level LLC). But it contradicts what you said because you said (without any reservations) that "L3 caches are sometimes too slow to provide meaningful speed ups", which ... well, if they didn't meaningfully do so in real-world workloads it would be pretty odd for AMD to invest millions into stacked L3 cache tech, IMO.

I don't get it, what's so hard to understand about the word "sometimes" ?

Valantar said:
But given that Intel's latest L1 cache size increase (24 to 32K, IIRC) came with a 1-cycle latency penalty, I can't quite see how they (or AMD) would suddenly pull a 3-6x increase in cache sizes out of their sleeves without also dramatically increasing latencies, which begs the question of whether others would even be able to make a similarly huge, wide, and cache-rich core without it being hobbled by slow cache accesses and thus not being fed. That seems to be the case, as we would otherwise most likely see much wider designs for servers and other markets where costs don't matter.

It all has to do with TLB and the size of pages that it uses to map the memory space. Basically Apple is using larger pages than on x86 platforms so the search space for the same amount of memory is smaller, hence lower latencies.

Dredi · Oct 19, 2021

Vya Domus said:
But it's exceedingly easy to stumble across, it doesn't have to be silly.

But exceedingly far from the actual lower bound IPC, which you said was somehow important to know.

TheoneandonlyMrK said:
Crack on show a link then no one calls it a system measurement ,IPC is not a system measurement.
It's a core micro architecture measurement, and is only indicative of performance not demonstrative of all performance.

Of course it’s indicative of only the performance measured. I’d never generalize IPC measurements to overall performance.

Just read from here: https://en.m.wikipedia.org/wiki/Instructions_per_cycle

”The number of instructions executed per clock is not a constant for a given processor; it depends on how the particular software being run interacts with the processor, and indeed the entire machine, particularly the memory hierarchy.”

Vya Domus · Oct 19, 2021

Dredi said:
But exceedingly far from the actual lower bound IPC, which you said was somehow important to know.

How is it "exceedingly" far ? Even misaligned SSE/AVX loads or the odd floating point division here and there can destroy the IPC, you don't even have to mess around with the code find out how you can create the most cache misses or mispredictions. Plus you're not even testing the IPC at that point, you're just writing terrible code in general.

Valantar · Oct 19, 2021

Vya Domus said:
It's ridiculously inefficient and it's not scalable. It's funny, they used to make fun of other smartphone manufacturers because "everyone can make something bigger". Now they do the same with their chips.

Well, depends what type of efficiency you're looking for. It's not efficient per area or transistor, but per unit of power or clock speed? Massively so.

Vya Domus said:
I don't get it, what's so hard to understand about the word "sometimes" ?

Apparently equally hard as it is to understand that your "sometimes" in this case isn't particularly applicable, neither in this case nor in other relevant comparisons. That doesn't mean it's untrue, it just means it's not particularly relevant as an objection.

Vya Domus said:
It all has to do with TLB and the size of pages that it uses to map the memory space. Basically Apple is using larger pages than on x86 platforms so the search space for the same amount of memory is smaller, hence lower latencies.

Again: if it was that simple, why aren't everyone doing that? Given how many server chips Intel sells, if they could make a huge core like this for servers and deliver 50% higher IPC and ISO performance at half the power per core, they would do so, regardless of the area needed. You could always blame server vendors for not wanting to adopt such a system, but frankly I don't think that would be a problem. Google and Facebook would gobble them up for R&D purposes if nothing else, and they wouldn't care if the CPUs were $10 000 apiece. (Also, they use 16KB TLB pages, but they are also compatible with 4KB pages.)

And that's the key here: the individual parts of what Apple is doing here might not be that impressive, but that they're managing to make all of this into a functional, well balanced and highly performant and efficient core? That is impressive. Very much so. In the end, what matters at the user end is performance and power consumption, which are always in tension, especially in mobile and SFF use cases. The M1 (and upcoming siblings) manages to shift to an entirely different level in this balance, most likely delivering 5800X-level performance (if not higher) at half the power or less (a 5800X is ~140W under full boost and an all-core load after all, these are 50-60W chips), while also containing either a mid-range or high-end dPGU-level iGPU. That is obviously impressive. Will it come with tradeoffs? Of course it will. Concurrent CPU and GPU loads will be power and/or thermally limited, as always, and they do spend an almost silly amount of silicon per chip. But does that matter when the laptop is comparably priced to competitors? No. And sure, you can no doubt find a comparable laptop for less. But a 5980HX+3080/Quadro RTX workstation isn't going to cost you any less than an M1 Max MBP, and both that and the cheaper consumer-focused version is going to be much bigger and heavier, and have terrible battery life. Making a product is, when it comes down to it, about the full package. These chips clearly have downsides, but they are downsides that are largely immaterial in the context of the overall package. And that's what makes them impressive.

Vya Domus · Oct 19, 2021

Valantar said:
Again: if it was that simple, why aren't everyone doing that?

I guess it would need a lot of changes on many levels, I am not sure, I am not that well versed in these details to be able to tell. The point is there is nothing that incomprehensible about how they achieved these things.

Valantar said:
It's not efficient per area or transistor, but per unit of power or clock speed? Massively so.

But what's the point if say you achieve X times the efficiency using more than X times the area ? The disadvantages will eventually outpace the advantages, and you'll be stuck with design that's hard to change.

Valantar said:
And that's the key here: the individual parts of what Apple is doing here might not be that impressive, but that they're managing to make all of this into a functional, well balanced and highly performant and efficient core? That is impressive. Very much so.

It's not that the end product isn't impressive, it's how they got there than isn't.

A 400+ mm^2 SoC on the newest node with 400GB/s bandwidth that's really fast ? Wow... I guess.

Dredi · Oct 19, 2021

Vya Domus said:
How is it "exceedingly" far ? Even misaligned SSE/AVX loads or the odd floating point division here and there can destroy the IPC, you don't even have to mess around with the code find out how you can create the most cache misses or mispredictions. Plus you're not even testing the IPC at that point, you're just writing terrible code in general.

But IPC is always application specific!!!
Terrible IPC always requires terrible code to go with it.

my example got to around one instruction per 100 clock cycles, and is likely the worst you can get to without disabling processor features. What is the IPC in your misaligned avx loads?

Vya Domus · Oct 19, 2021

Dredi said:
But IPC is always application specific!!!
Terrible IPC always requires terrible code to go with it.

Then what's the point of measuring it ? What do you do with that information ?

r9 · Oct 19, 2021

Personally when I heard of M1 I though Apple gonna make laptop shaped phone thay will be cheap to make but they will charge the same with poor software support but after got released was watching few YouTube videos and
all I know is that people were very happy with it especially with things like rendering and battery life and having the same performance on on battery as on power something nor AMD or Intel can ever do with x86. Video rendering discussion can be closed as Intel/AMD wont be able to come even close to pro/max. Also OSX support is light-years ahead of Windows for ARM. ARM can be the future if Microsoft can make something as efficient as Rosetta. IMO 95% of people use only 10% of the instructions set so why not rip all the benefits of a RISC chip for the majority of people and for those 5% can always go with Intel/AMD. So it makes much more sense ARM to be the mainstream option not the other way around. The problem is only way we get proper PC/Windows ARM platform is for AMD and Intel to enter that market. And it will all depend on what Apple does with it but Apple being Apple they create their own markets and sell expensive laptops to only very small portion of the global laptop market so it won't be like AMD or Intel will ever be in position where they have no choice but to switch to ARM.

Dredi · Oct 19, 2021

Vya Domus said:
Then what's the point of measuring it ? What do you do with that information ?

You are the one who insisted that knowing the lower bound of the IPC was important!!

I have never stated that knowing it is of any importance.

IPC in itself (or rather WPC) is a great way to compare different systems analytically. I.e. to understand differences in generic application performance and where the differences might come from.

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 4TB External
Display(s)	Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	96w Power Adapter
Mouse	Logitech MX Master 3
Keyboard	Logitech G915, GL Clicky
Software	MacOS 12.1

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s)	Powercolour RX7900XT Reference/Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	8726 vega 3dmark timespy/ laptop Timespy 6506

System Name	Hotbox
Processor	AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard	ASRock Phantom Gaming B550 ITX/ax
Cooling	LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory	32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s)	PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage	2TB Adata SX8200 Pro
Display(s)	Dell U2711 main, AOC 24P2C secondary
Case	SSUPD Meshlicious
Audio Device(s)	Optoma Nuforce μDAC 3
Power Supply	Corsair SF750 Platinum
Mouse	Logitech G603
Keyboard	Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software	Windows 10 Pro

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

R0H1T

Aquinus

Resident Wat-man

Vya Domus

Dredi

TheoneandonlyMrK

Valantar

Dredi

Punkenjoy

Dredi

Vya Domus

Aquinus

Resident Wat-man

Dredi

Vya Domus

Dredi

Valantar

TheoneandonlyMrK

Vya Domus

Dredi

Vya Domus

Valantar

Vya Domus

Dredi

Vya Domus

r9

Dredi

System Name	Primary\|Secondary\|Poweredge r410\|Dell XPS\|SteamDeck
Processor	i7 11700k\|i7 9700k\|2 x E5620 \|i5 5500U\|Zen 2 4c/8t
Memory	32GB DDR4\|16GB DDR4\|16GB DDR4\|32GB ECC DDR3\|8GB DDR4\|16GB LPDDR5
Video Card(s)	RX 7800xt\|RX 6700xt \|On-Board\|On-Board\|8 RDNA 2 CUs
Storage	2TB m.2\|512GB SSD+1TB SSD\|2x256GBSSD 2x2TBGB\|256GB sata\|512GB nvme
Display(s)	50" 4k TV \| Dell 27" \|22" \|3.3"\|7"
VR HMD	Samsung Odyssey+ \| Oculus Quest 2
Software	Windows 11 Pro\|Windows 10 Pro\|Windows 10 Home\| Server 2012 r2\|Windows 10 Pro