• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

Good for them. It would be nicer if it didn't came from Apple as i have absolutely no intention of buying anything from them.
 
The M1 Max, at least on paper, makes every other CPU seem like a decade out of date... How can this be?

The only crazy thing I'm seeing so far is the high LPDDR5 bandwidth of 400GBps.

I'm not really seeing anything else super-special about this actually. EDIT: 5nm is also cool, but that's largely TSMC + a function of Apple's money. TSMC is very advanced, and Apple can afford the best.
 
Last edited:
That's pretty big. I'm curious how this memory system works.

Its big enough that I'm instinctively thinking that's a typo there. 400GB/s is huge for a CPU / iGPU. The only systems close to that are XBox / PS5 game consoles with GDDR graphics ram.
AMD 4700S has 256-bit GDDR6-14000 i.e. PS5 recycled APU for the PC market.
 
Oh shiiiiiiiiiiit ....
The M1 was running Witcher 3 x86 at 30fps, I really want to see what this monster can do with 4x gpu power on games that run natively.
 
We know how the M1 performs - in terms of IPC it trounces both Intel and AMD, matching or beating their peak single core performance at 2/3-3/5 the clock speed (3.1GHz vs 4.8/5.3-ish). These more than double the core counts, and double/quadruple the memory bandwidth to feed the cores. Also, Apple has absolutely insane amounts of cache (at equally insane latencies) with their recent chips. This will be a beast, it just needs software to make use of the power. Which it likely will have (Adobe CS etc. are already native).

The interfaces are 256-bit (M1 Pro) and 512-bit (M1 Max). Probably a bit power hungry, sure, but they are mounted extremely close to the SoC, on the same package, so they've likely optimized for that. Plus, these are 40-60W SoCs. The memory power isn't going to be an issue.
Rafael h and alder lake s will destroy overrated m1
 
Oh shiiiiiiiiiiit ....
The M1 was running Witcher 3 x86 at 30fps, I really want to see what this monster can do with 4x gpu power on games that run natively.
easy : look xbox serie S or ps5 : slide write 10.4 TFLOP
 
Oh shiiiiiiiiiiit ....
The M1 was running Witcher 3 x86 at 30fps, I really want to see what this monster can do with 4x gpu power on games that run natively.

30 fps, 1080p, and lowest settings. Maybe you will get 30 fps, 1080p, and max settings.
 
CPUs have to transfer data to the GPUs all the time (and sometimes rarely, maybe a GPU->CPU transfer). One of the key advantages of a SOC is that this "data transfer" takes place in L3 cache instead of over system memory.

I find it hard to believe that Microsoft would design a SOC like the XBox Series X and ignore this simple and useful optimization. I see that Microsoft is playing cute games with its 10+6 GB layout, but I'm pretty sure they're just saying that CPUs use less memory bandwidth, so 10GB of fast-RAM + 6GB of slow-RAM is intended for the CPU to use slow-RAM and GPU to use fast-RAM. But both CPU+GPU should have access to both halfs.

If for no other reason than to optimize the "no copy" methodology between CPU -> GPU data transfers. (Why ever copy data when GPUs can simply just read the RAM themselves?). In dGPU world, you need to transfer the data over PCIe because the VRAM is physically a different chip. But in XBox Series X land, VRAM and RAM are literally the same chips, no copying needed.
1. For games, the shared memory usage is relatively minor. PC has reBar resize that enabled PC CPU to directly access the entire GPU's VRAM. CPU wouldn't be able to keep up with dGPU's large-scale scatter-gather capability.

2. Shared memory has its downsides with context switch overheads. CPU IO access can gimp GPU's burst mode IO access e.g. frame buffer burst IO access shouldn't be disturbed.

Late 1980s Amiga's Chip Ram is shared memory between the CPU and iGPU (custom chips).
 

Attachments

  • Shared Memory allocaton.png
    Shared Memory allocaton.png
    510.9 KB · Views: 82
  • PS4-GPU-Bandwidth-140-not-176.png
    PS4-GPU-Bandwidth-140-not-176.png
    70.3 KB · Views: 94
Last edited:
6K , with what sounds like 3.5k minimum spend for 32 GB ram, and in some cases it's said (look around I'm not posting links to other tech sites)to be beat by the outgoing Intel 9th gen chip's sooo, there's that.
Go one their site, I just priced it out. $4,200 for the Max, 64GB of memory, and a 2TB drive which is about what I paid (sans discounts I can get,) for mine in my specs. That's really not bad considering what you're getting if you're comparing it to the previous 16". In that respect, Apple has kept pricing consistent, but has theoretically given it an absolutely massive performance uplift within the same power constraints.

Edit: Mind you that these are US prices in USD.
 
Unified is exactly like the Ps5 and Xbox.
One pool of memory for any use.
So apple clearly were not first and are doing something similar..
The GPU or CPU Can make memory calls in those.
Though inevitably the MMU is going to be on the edge of the soc on a buss.
1985 era Amiga 1000 has a shared memory design.
 
Go one their site, I just priced it out. $4,200 for the Max, 64GB of memory, and a 2TB drive which is about what I paid (sans discounts I can get,) for mine in my specs. That's really not bad considering what you're getting if you're comparing it to the previous 16". In that respect, Apple has kept pricing consistent, but has theoretically given it an absolutely massive performance uplift within the same power constraints.

Edit: Mind you that these are US prices in USD.

I wonder how can people seriously consider buying that crap when they know, or should at least, what they do with anti consumer BS?
 
I wonder how can people seriously consider buying that crap when they know, or should at least, what they do with anti consumer BS?
For being so anti-consumer, they sure do make a good machine for work and play if you can afford it.
 
Interesting they are still based on the A14 platform and not A15.
 
man does apple make me laugh with their closed limited OSes and their potato mobile processors.

Oh yeah, it's just a mobile SoC built using the most advanced node available in the world that can probably beat every other mobile SoC around singlehandedly. Nothing worthy a second of actual interest. /s

:rolleyes:
 
Interesting they are still based on the A14 platform and not A15.
A15 based will probably be call M2 or something. I think Apple is still trying to find out what happens when these chips are scaled up
 
Oh yeah, it's just a mobile SoC built using the most advanced node available in the world that can probably beat every other mobile SoC around singlehandedly. Nothing worthy a second of actual interest. /s

:rolleyes:
proof or didn't happen
 
Interesting they are still based on the A14 platform and not A15.
They probably designed this based on the A14 while another team was working on the A15.
 
It's a bit strange for you to bring up the Epyc/TR comparison just to then say it's not a valid comparison once people get into why this is likely to be more efficient.
That's because off hand I can't think of any other chip(s) that move such vast sums of data between massive cores, in case of Apple it's also the GPU cores now, & pay a heavy (energy) price for that. Moving (lots of) data quickly is the next big hurdle in computing & the SoC approach for now seems to be more efficient ~ the reason why it isn't directly comparable because even now the top end server chips should beat Apple in most tasks they're actually designed for but they're also generally less efficient. The SoC approach isn't really scalable beyond low double digit CPU cores especially if you're putting such a massive GPU in there!
 
Last edited:
seems everybody should drop epyc processors for their servers. they are obsolete :roll:
long time since I had such a good laugh
no wonder apple makes millions. you guys would believe about anything they say
 
Right, no one's saying that unless you meant some other poster?

The EPYC/TR way is meant for massive amounts of CPU cores which Apple doesn't seem to need right now. That's also in part due to the dedicated accelerators they're using for a lot of tasks. IIRC zen4 (5?) will introduce similar accelerators on die probably courtesy their Xilinx acquisition. My biggest curiosity then would be how much efficient their monolithic (APU) dies would be wrt the M1 & now M1 Pro & Max.
 
Oh yeah, it's just a mobile SoC built using the most advanced node available in the world that can probably beat every other mobile SoC around singlehandedly. Nothing worthy a second of actual interest. /s

:rolleyes:
They wouldn't beat amd on the same node thou.. zen 4 on 5nm will crush this expensive chip
 
Or AMD is going to be doing that with Zen 4/RDNA3. The consoles APU's are custom designs, not a straight up Zen 2 design. They have features not in the deskptop APU's.
And? Unified memory isn't just a hardware feature, it's a hardware+OS feature. And there's no indication that either XSX or PS5 have truly unified memory.
What's the difference? Is the memory "truly unified" only if memory access is governed by a single MMU for both CPU and GPU?
No, it must also be accessible to the entire system without the need for copying.
I mean... its called the PS5 / XBox Series X.

I'm pretty sure they have unified memory. Hell, CUDA + CPU / OpenCL + CPU has unified memory. Its just emulated over PCIe. PS5 / XBox Series X actually have the same, literal RAM work for the iGPU side and CPU side.
It's still walled off, and needs copying, thus it isn't actually unified.
Unified is exactly like the Ps5 and Xbox.
One pool of memory for any use.
So apple clearly were not first and are doing something similar..
The GPU or CPU Can make memory calls in those.
Though inevitably the MMU is going to be on the edge of the soc on a buss.
See above. It is only truly unified if every component has full access to RAM, which is what Apple is claiming here. No PC or current x86-based platform has that.
CPUs have to transfer data to the GPUs all the time (and sometimes rarely, maybe a GPU->CPU transfer). One of the key advantages of a SOC is that this "data transfer" takes place in L3 cache instead of over system memory.

I find it hard to believe that Microsoft would design a SOC like the XBox Series X and ignore this simple and useful optimization. I see that Microsoft is playing cute games with its 10+6 GB layout, but I'm pretty sure they're just saying that CPUs use less memory bandwidth, so 10GB of fast-RAM + 6GB of slow-RAM is intended for the CPU to use slow-RAM and GPU to use fast-RAM. But both CPU+GPU should have access to both halfs.

If for no other reason than to optimize the "no copy" methodology between CPU -> GPU data transfers. (Why ever copy data when GPUs can simply just read the RAM themselves?). In dGPU world, you need to transfer the data over PCIe because the VRAM is physically a different chip. But in XBox Series X land, VRAM and RAM are literally the same chips, no copying needed.
But copying is needed for those, as the CPU and GPU have discrete areas of memory set aside for them.
Isn't that the case with every Intel and AMD processor with integrated graphics? At least since Haswell for Intel (AnandTech) and since Kaveri for AMD (Wikipedia).
No, iGPUs have system memory set aside for them - some static, some dynamic. This memory is not accessible to the CPU, and regular system memory is not accessible to the iGPU, necessitating copying data between the two.
Anandtech is speculating it’s probably 64MB on the Max, 32MB on the Pro. They are looking at the actual die shots (provided in the presentation, interestingly), not the illustrative diagram Apple used in the presentation.
That's lower than I would have expected, but then diagrams are always misleading. I wonder if that judgement is correct though, as the new SLC blocks look much bigger than on the M1, which had 16MB. On the M1 the SLC block is slightly larger than two GPU "cores", on the M1P/M it's larger than four. Of course, not all of this is actually cache, and a lot of it is likely interconnects and other stuff, but 2x16MB still seems low to me.
Yeah, its not a new feature at all.



But as Wirko has pointed out: this isn't new at all. Intel / AMD chips have been doing zero-copy transfers on Windows for nearly a decade now on its iGPUs.

Yes, that is even on Windows 10, which is HyperV virtualized for security purposes. (The most secure parts of Windows start up in a separate VM these days, so that not even a kernel-level hack can reach those secrets... unless it also includes a VM-break of some kind)

Now don't get me wrong: XBox Series X has a weird / complicated memory scheme going on. But I'd still expect that this extremely strange memory scheme was unified, much akin to AMD's Kaveri or Intel iGPU stuffs that you'd find on any typical iGPU for the past decade.
It clearly isn't, when they wall off sections of RAM for the OS, CPU software and GPU software. Discrete memory regions implies that copying is needed between them, which means it isn't unified.
The M1 Max, at least on paper, makes every other CPU seem like a decade out of date... How can this be?
Money, mainly. Apple can afford to outspend everyone on R&D, by a huge margin.
1. For games, the shared memory usage is relatively minor. PC has reBar resize that enabled PC CPU to directly access the entire GPU's VRAM. CPU wouldn't be able to keep up with dGPU's large-scale scatter-gather capability.

2. Shared memory has its downsides with context switch overheads. CPU IO access can gimp GPU's burst mode IO access e.g. frame buffer burst IO access shouldn't be disturbed.

Late 1980s Amiga's Chip Ram is shared memory between the CPU and iGPU (custom chips).
ReBAR doesn't have anything to do with this - it allows the CPU to write to the entire VRAM rather than smaller chunks, but the CPU still can't work off of VRAM - it needs copying to system RAM for the CPU to work on it. You're right that shared memory has its downsides, but with many times the bandwidth of any x86 CPU (and equal to many dGPUs) I doubt that will be a problem, especially considering Apple's penchant for massive caches.
That's because off hand I can't think of any other chip(s) that move such vast sums of data between massive cores, in case of Apple it's also the GPU cores now, & pay a heavy (energy) price for that. Moving (lots of) data quickly is the next big hurdle in computing & the SoC approach for now seems to be more efficient ~ the reason why it isn't directly comparable because even now the top end server chips should beat Apple in most tasks they're actually designed for but they're also generally less efficient. The SoC approach isn't really scalable beyond low double digit CPU cores especially if you're putting such a massive GPU in there!
Yes, but that's precisely why pointing out that the M1P/M are monolithic allows for huge power savings as they don't need off-die interfaces for most of this. Keeping data on silicon is a massive power savings. Of course they're working with 10 (8+2) CPU cores and an 8-"core" GPU, not a 32-64-core CPU, so the interfaces can also be much, much simpler.
They wouldn't beat amd on the same node thou.. zen 4 on 5nm will crush this expensive chip
That's debatable. Apple's architecture team is doing some incredible work the past years. Their cache architecture (which is something that doesn't gain that much from node changes) is far superior to anything else (look at the cache access benchmarks in the AnandTech article I linked above), and their huge CPU cores have a >50% IPC lead over both Intel and AMD, matching their performance at much lower clocks (in part thanks to those huge, low-latency caches, but not only that). A higher core count chip from AMD will still likely win in a 100% MT workload, but the power difference is likely to be significant.
proof or didn't happen
Here's Anandtech's SPEC2006 and SPEC2017 testing of the M1. Those are industry standard benchmarks for ST performance, and the M1 rivals the 5950X at a fraction of the power, and much lower clocks. These chips use the same architecture but with more cache, more RAM, and a much higher power budget.
 
Back
Top