• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Riding on the Success of the M1, Apple Readies 32-core Chip for High-end Macs

AMD Ryzen 5000 = Evolutionary
Apple M1 = Revolutionary

The M1 has made it obvious that ARM will dominate the desktop within the next 10 years. The average desktop user doesn't need an x86 CPU. x86 users are about to become a niche market. Efficiency has clearly won out over flexibility.

In terms of efficiency, the M1 is currently the best CPU and GPU on the market. Apple's more powerful chips will also be more efficient than any competing products.

A lot of PC users are already in denial over the M1's superiority, and they'll stay that way for a long time because they're stupid.

I agree 100% I think x86 is around to stay but it will be for gaming/HPC. Every day ARM CPUs will have more than enough horsepower for your day to day task. I just got an 13" 11th gen Intel laptop and I was so tempted to get an Macbook Air with the M1 simply because the battery life is OUTSTANDING and it can do anything I need to do on the X86 platform. It's also just apple who is making massive strides in ARM. Qualcomm but be a year behind but they have no problem parrying Apple when it comes to ARM. I will be very interested in an ARM based Windows platform in the future (assuming Windows is ready).
 
Every day ARM CPUs will have more than enough horsepower for your day to day task.

I have literally heard this for over a decade now, it's always in conjunction with something else. Netbooks were supposed to be enough for everyday tasks, tablets, chromebooks, etc, all supposedly powered by cheap efficient SoCs yet God knows how many millions of laptops with dedicated x86 CPU and GPUs continue being shipped every year.
 
So... I know that audio-professionals are super-interested in low-latency "big-cores". (Mostly because, I believe, those DSP programmers don't know how to take advantage of GPUs quite yet. GPUs are being used in 5GHz software-defined radios, I'm pretty sure your 44.1kHz audio-filters are easier to process... but I digress). Under current audio-programming paradigms, a core like the M1 is really, really good. You've got HUGE L1 cache to hold all sorts of looped-effects / reverb / instruments / blah blah blah, and you don't have any complicated latency issues to deal with. (A GPU would have microseconds of delay per kernel. It'd take an all-GPU design to negate the latency issue: more effort than current audio-engineers seem to have).

So I think a 32-core M1 probably would be a realtime audio-engineer's best platform. At least, until software teams figure out that 10TFlops of GPU-compute is a really good system to perform DSP-math on + rejig their kernels to work with the GPU's latency. (Microseconds of latency per kernel: small enough that its good for real-time audio, but you don't really have much room to play around with. It'd have to be optimized to just a few dozen kernel invocations to match the ridiculous latency requirements that musicians have)
 
I have literally heard this for over a decade now, it's always in conjunction with something else. Netbooks were supposed to be enough for everyday tasks, tablets, chromebooks, etc, all supposedly powered by cheap efficient SoCs yet God knows how many millions of laptops with dedicated x86 CPU and GPUs continue being shipped every year.
Very true someone will always want as similar as possible desktop performance in a portable device even though they never quite achieve it in the same sense of the meaning. Still today's desktop is tomorrows laptop.
 
Assuming this is true, Apple aren't playing around. This could have consequences for the entire industry.
Of course big changes doesn't happen over night, and big software ecosystem is one off those things that's particularly slow to change.
Windows on ARM, Nvidia's bid for ARM, there're rumors of AMD working on ARM design. Intel?
x86 vs ARM. Interesting times ahead.
 
So... I know that audio-professionals are super-interested in low-latency "big-cores". (Mostly because, I believe, those DSP programmers don't know how to take advantage of GPUs quite yet. GPUs are being used in 5GHz software-defined radios, I'm pretty sure your 44.1kHz audio-filters are easier to process... but I digress). Under current audio-programming paradigms, a core like the M1 is really, really good. You've got HUGE L1 cache to hold all sorts of looped-effects / reverb / instruments / blah blah blah, and you don't have any complicated latency issues to deal with. (A GPU would have microseconds of delay per kernel. It'd take an all-GPU design to negate the latency issue: more effort than current audio-engineers seem to have).

I don't know if that's quite true, the larger the core, the worse the latency is because of all that front end pre-processing to figure out the best scheme to execute the micro-ops. If you want low latency you need a processor as basic as possible with a short pipeline, M1 is or will be good probably because of dedicated DSPs.
 
I don't know if that's quite true, the larger the core, the worse the latency is because of all that front end pre-processing to figure out the best scheme to execute the micro-ops. If you want low latency you need a processor as basic as possible with a short pipeline, M1 is or will be good probably because of dedicated DSPs.

Assuming 44.1kHz, you have 22-microseconds to generate a sample. That's your hard limit: 22-microseconds per sample. A CPU task-switch is on the order of ~10-microseconds. Reading from SSD is ~1-microsecond (aka: 100,000 IOPS). Talking with a GPU is ~5 uS. Etc. etc. You must deliver the sample otherwise the audio will "pop", and DJ's don't like that. You can batch-samples up together into 44 to 88 sample chunks (1ms to 2ms "delivered" to the audio driver) at a time, but if you go too far beyond that, you'll start to incur latency and DJ's also don't like that.

So we're not talking about nanosecond-level latency (where microarchitecture decisions matter). There's still 22,000 nanoseconds per sample after all. But it does mean that if you fit inside of L1 vs L2, or maybe L2 vs L3... those sorts of things really matter inside the hundreds-of-microseconds timeframe.

Audio programs live within that area: from ~20 microseconds to 1000-microseconds range. Some things (ex: micro-op scheduling) are too fast: micro-op scheduling changes things at the 0.0005 microsecond (or half-a-nanosecond) level. That's not going to actually affect audio systems. Other things (ex: 5uS per GPU kernel invocation) are serious uses of time and need to be seriously considered and planned around. (Which is probably why no GPU-based audio software exists yet: that's cutting it close and it'd be a risk)

-------

The 128kB L1 cache of Apple's M1 means that L1 cache fits the most "instrument data" (or so I've been told). I'm neither an audio-engineer, nor an audio-programmer, nor audio-user / musician / DJ or whatever. But when I talk to audio-users, those are the issues they talk about.
 
Last edited:
If they can, thats cool. But I don't know many users (including many ex Mac users) that want to pay Apple prices for the tier of performance at that price (and I highly doubt Apple wants to give up their margins).
 
Assuming 44.1kHz, you have 22-microseconds to generate a sample. That's your hard limit: 22-microseconds per sample. A CPU task-switch is on the order of ~10-microseconds. Reading from SSD is ~1-microsecond (aka: 100,000 IOPS). Talking with a GPU is ~5 uS. Etc. etc. You must deliver the sample otherwise the audio will "pop", and DJ's don't like that. You can batch-samples up together into 44 to 88 sample chunks (1ms to 2ms "delivered" to the audio driver) at a time, but if you go too far beyond that, you'll start to incur latency and DJ's also don't like that.

So we're not talking about nanosecond-level latency (where microarchitecture decisions matter). There's still 22,000 nanoseconds per sample after all. But it does mean that if you fit inside of L1 vs L2, or maybe L2 vs L3... those sorts of things really matter inside the hundreds-of-microseconds timeframe.

Audio programs live within that area: from ~20 microseconds to 1000-microseconds range. Some things (ex: micro-op scheduling) are too fast: micro-op scheduling changes things at the 0.0005 microsecond (or half-a-nanosecond) level. That's not going to actually affect audio systems. Other things (ex: 5uS per GPU kernel invocation) are serious uses of time and need to be seriously considered and planned around. (Which is probably why no GPU-based audio software exists yet: that's cutting it close and it'd be a risk)

-------

The 128kB L1 cache of Apple's M1 means that L1 cache fits the most "instrument data" (or so I've been told). I'm neither an audio-engineer, nor an audio-programmer, nor audio-user / musician / DJ or whatever. But when I talk to audio-users, those are the issues they talk about.
All the sampling I need. The only thing I'd gripe about is the sampling rate is a mere 44.1kHz CD quality I'm not sure how I can live with such lofi sound.
1607390974534.png
 
  • Love
Reactions: SL2
For 32 "big" cores, heck 16 cores & above, what they'll need is something closer if not better than IF. As Intel have found out they don't grow them glue on trees anymore :slap:
Apple better have something, really anything similar otherwise it's going to be a major issue no matter how or where their top of the line chips end up!
 
They're likely gonna need a node beyond 5nm, the big core cluster on their current chip is huge, 32 cores would mean a ridiculously large chip not to mention that they would also probably need to increase the size of that system cache. Of course some of those cores could be like the ones inside the small core cluster so it would be "32 core" just in name really.
Why don't you go away? Every time there is news or leaks about Apple's progress, you're in the first post sh***ing on it. I have never seen you give them even an inch of slack without mocking them in the same sentence. So I have to assume you must be a troll or just an extremely irrational Apple hater.

Apple already proved that their own cores can stand up to those of Intel and AMD and even surpass them in some applications. This successor to the M1 will definitely give Apple computers an enormous performance-boost. This new chip is planned to be for High-End-Desktop mind you. So this is almost definitely for the new Mac Pro that is supposed to come out in 2022. Bloomberg already talked about it the next day Apple revealed the M1 for the first time last month.

Furthermore: The GPU cores in these future High-End-MacBooks and iMacs are supposed to go up in count massively. The iGPU of the M1 has just 7 or 8 cores depending on the model. Now they're talking about an increase to 64 or even 128 GPU cores. These chips are going to beat all the AMD and Nvidia dedicated GPUs that Apple is currently offering. This is written in the actual source for this news post.

"Apple Preps Next Mac Chips With Aim to Outclass Top-End PCs"
https://www.bloomberg.com/news/arti...hest-end-pcs?srnd=technology-vp&sref=51oIW18F

And let's just not ignore what's going to happen. Even if they have to slash all their old prices to win over hearts, they will do it. Apple is already doing it with the first iteration of notebooks. They want to increase the demand as much as possible and generating hype with low prices and deals, is like Apple's joker card they have never really needed to use. Until now. They have everything lined up for the big win here. All Apple. Total domination and control of the market by one company. This is Apple's dream. If they want demand to go up by as much as I think they are aiming for here, prices will go down. It's that simple. I don't think Apple is trying to just compete with Intel and AMD. The goal here is obviously to crush them. The first iteration of M1 already showed that in some sense. And they would be stupid to let down now.
 
Last edited:
You mean a 32 core CPU+128 core GPU will beat say the A100 80GB outright? Yeah even if its Apple I doubt they'll pull that off, first of all the cooling on such a chip will have to be extreme level unless they're clocking & actively limiting both the components to unrealistically low levels!
 
You mean a 32 core CPU+128 core GPU will beat say the A100 80GB outright? Yeah even if its Apple I doubt they'll pull that off, first of all the cooling on such a chip will have to be extreme level unless they're clocking & actively limiting both the components to unrealistically low levels!
I was talking about consumer cards. I hope for maybe a 16 core CPU+64 core GPU iMac with better cooling. That should easily have 4K performance on most games above 60 fps even on high settings. The 128 core GPU in just 1 or 2 more years almost sounds too good to be true.
 
Last edited:
First in house CPU now in house GPU. Apple is looking to ditch the entire backbone of current PC industry. Feels like going back in time TBH when Apple closed off everything. If the performance is there it can totally justify Apple doing so. Weird when consoles from MS and Sony look like PCs and Mac turning into Console like closing off.

Maybe that is what Intel / AMD / Nvidia need to come out with ever better products. Apple has the advantage of positive consumer perception (among its fans).

Apple has the advantage of optimizing almost every aspect of their systems vertically, which no other company does and this is why they are going to make the best performing AI systems, regardless of their fans. M1 is a fact.
 
this is why they are going to make the best performing AI systems, regardless of their fans.

That's particularity hilarious because apparently not even Apple themselves think their machine learning accelerators are good. Check this out : https://developer.apple.com/documentation/coreml/mlcomputeunits

You can use the GPU and CPU explicitly but not the NPU, you can only vaguely let the API "decide". If it was that good, why don't they let people use it ? Hint : It probably isn't that good.
 
That's particularity hilarious because apparently not even Apple themselves think their machine learning accelerators are good. Check this out : https://developer.apple.com/documentation/coreml/mlcomputeunits

You can use the GPU and CPU explicitly but not the NPU, you can only vaguely let the API "decide". If it was that good, why don't they let people use it ? Hint : It probably isn't that good.
Looks pretty decent with TensorFlow, even though support is only in Alpha. Maybe we need the software to mature a bit, but it sounds capable enough.


 
Looks pretty decent with TensorFlow, even though support is only in Alpha. Maybe we need the software to mature a bit, but it sounds capable enough.



You'd get the same results on any half decent integrated GPU (apart from Intel's I guess but shouldn't surprise anyone). The only reason it runs fast when the data is small is not because the GPU itself is amazing it's simply because you don't need to wait for the data to be transferred across the PCIe connection since it's using the same pool of memory with the rest of the system (and some pretty large caches). When the data set grows in size that becomes less and less important and M1 GPU gets crushed, not to mention that the 2080ti isn't even the fastest card around anymore. Anyway, GPUs are GPUs, not much differs between them. I am sure that a dedicated GPU of theirs with a million billion cores will be faster, it's really just a matter of who can make the biggest GPU.

I was talking about the actual ML accelerator which Apple chose to not explicitly expose, that's a sign they're not that confident in the one thing that could really set them apart. If you can't chose the NPU in their own API, I don't think TensorFlow will get support for that any time soon.


This is guy is trying to get it to run code arbitrarily and let's just say Apple goes out of their way to make that really, really hard.
 
was talking about the actual ML accelerator which Apple chose to not explicitly expose, that's a sign they're not that confident in the one thing that could really set them apart. If you can't chose the NPU in their own API, I don't think TensorFlow will get support for that any time soon.
Maybe there is a reason why they haven't exposed it. Maybe they've exposed the parts you need to know. I watched some of this video and the thing that strikes me is that the guy doesn't know Swift and that he's trying to used disassembled APIs to interact with it and a lot of the calls he was looking at (at least in the part I watched,) were things I would expect the kernel to handle. With that said, I get the distinct impression that this is 4 hours of a guy trying to figure out the platform he's working on.

I've taken a brief look at Apple documentation and he seems to be making it way harder than it has to be. Apple has simplified a lot of parts of model processing which is why the API is so thin. I suspect that between not understanding the platform, or Swift, while trying to reverse engineering system level calls, is probably going down the wrong rabbit hole.
You'd get the same results on any half decent integrated GPU (apart from Intel's I guess but shouldn't surprise anyone). The only reason it runs fast when the data is small is not because the GPU itself is amazing it's simply because you don't need to wait for the data to be transferred across the PCIe connection since it's using the same pool of memory with the rest of the system (and some pretty large caches). When the data set grows in size that becomes less and less important and M1 GPU gets crushed, not to mention that the 2080ti isn't even the fastest card around anymore.
Do I need to remind you that the M1 is literally Apple's entry level chip for the laptop/desktop market? Seems to do pretty well for an entry level product.
 
Maybe there is a reason why they haven't exposed it. Maybe they've exposed the parts you need to know. I watched some of this video and the thing that strikes me is that the guy doesn't know Swift and that he's trying to used disassembled APIs to interact with it and a lot of the calls he was looking at (at least in the part I watched,) were things I would expect the kernel to handle. With that said, I get the distinct impression that this is 4 hours of a guy trying to figure out the platform he's working on.

They've exposed nothing, that's the point. He's trying to get the NPU to always execute the code he wants which Apple does not allow, that's the problem he's trying to solve, using the API calls is useless since it will always fallback to GPU or CPU and you have no control over that.
 
They've exposed nothing, that's the point. He's trying to get the NPU to always execute the code he wants which Apple does not allow, that's the problem he's trying to solve, using the API calls is useless since it will always fallback to GPU or CPU and you have no control over that.
Without knowing more about how Apple implemented the hardware, it's hard to say, but there very well could be reasons for that. It could be very plausible that the AI circuitry consumes a lot more power than the CPUs or GPUs. It could be power management that dictates where it's run. Perhaps there are thermal reasons for it, or memory pressure reasons, or task complexity reasons. Maybe multiple tasks are running. Maybe it's a laptop that is in a low power mode and it's forcing it into the low power CPU cores compared to maybe a Mac Mini which would schedule the work differently with fewer power and thermal limitations. Apple probably suspects that they can better choose where the code needs to run than the developer and that the software shouldn't be specifically tied to a particular hardware implementation either.

As a software engineer, when I see something like this, it makes me think that it was done for a reason, not just for the sake of blackboxing everything. I know that Apple tends to do that, but they do that when they think they can do it better for you. Honestly, that's not a bad thing. I don't want to have to think about what part of the SoC is best going to run my code for the state that the machine is currently in. That's a decision best made for the OS in my opinion, particularly when you're tightly integrating all of the parts of a pretty complicated SoC like Apple is doing with their chips.
 
Great news but it also makes me glad I got the last of the Windows supported MB Pros right when the 5600m launched.

It's no gaming powerhouse but a nice way to have a little near-all-in-one of both worlds and flexibility on the go.

While I wish Windows support were there sadly I bet GPU support would be there too. It's interesting to see where the scene is inna years time. I was skeptical of the m1 bit glad to see positive traction.

(And before folks ask: there's no other way to work in Sketch and flip over to relaxing by blasting-faces in Borderlands 3.)
 
Back
Top