Or AMD is going to be doing that with Zen 4/RDNA3. The consoles APU's are custom designs, not a straight up Zen 2 design. They have features not in the deskptop APU's.
And? Unified memory isn't just a hardware feature, it's a hardware+OS feature. And there's no indication that either XSX or PS5 have truly unified memory.
What's the difference? Is the memory "truly unified" only if memory access is governed by a single MMU for both CPU and GPU?
No, it must also be accessible to the entire system without the need for copying.
I mean... its called the PS5 / XBox Series X.
I'm pretty sure they have unified memory. Hell, CUDA + CPU / OpenCL + CPU has unified memory. Its just emulated over PCIe. PS5 / XBox Series X actually have the same, literal RAM work for the iGPU side and CPU side.
It's still walled off, and needs copying, thus it isn't actually unified.
Unified is exactly like the Ps5 and Xbox.
One pool of memory for any use.
So apple clearly were not first and are doing something similar..
The GPU or CPU Can make memory calls in those.
Though inevitably the MMU is going to be on the edge of the soc on a buss.
See above. It is only truly unified if every component has full access to RAM, which is what Apple is claiming here. No PC or current x86-based platform has that.
CPUs have to transfer data to the GPUs all the time (and sometimes rarely, maybe a GPU->CPU transfer). One of the key advantages of a SOC is that this "data transfer" takes place in L3 cache instead of over system memory.
I find it hard to believe that Microsoft would design a SOC like the XBox Series X and ignore this simple and useful optimization. I see that Microsoft is playing cute games with its 10+6 GB layout, but I'm pretty sure they're just saying that CPUs use less memory bandwidth, so 10GB of fast-RAM + 6GB of slow-RAM is intended for the CPU to use slow-RAM and GPU to use fast-RAM. But both CPU+GPU should have access to both halfs.
If for no other reason than to optimize the "no copy" methodology between CPU -> GPU data transfers. (Why ever copy data when GPUs can simply just read the RAM themselves?). In dGPU world, you need to transfer the data over PCIe because the VRAM is physically a different chip. But in XBox Series X land, VRAM and RAM are literally the same chips, no copying needed.
But copying
is needed for those, as the CPU and GPU have discrete areas of memory set aside for them.
Isn't that the case with every Intel and AMD processor with integrated graphics? At least since Haswell for Intel (
AnandTech) and since Kaveri for AMD (
Wikipedia).
No, iGPUs have system memory set aside for them - some static, some dynamic. This memory is not accessible to the CPU, and regular system memory is not accessible to the iGPU, necessitating copying data between the two.
Anandtech is speculating it’s probably 64MB on the Max, 32MB on the Pro. They are looking at the actual die shots (provided in the presentation, interestingly), not the illustrative diagram Apple used in the presentation.
www.anandtech.com
That's lower than I would have expected, but then diagrams are always misleading. I wonder if that judgement is correct though, as
the new SLC blocks look much bigger than on the M1, which had 16MB. On the M1 the SLC block is slightly larger than two GPU "cores", on the M1P/M it's larger than four. Of course, not all of this is actually cache, and a lot of it is likely interconnects and other stuff, but 2x16MB still seems low to me.
Yeah, its not a new feature at all.
But as Wirko has pointed out: this isn't new at all. Intel / AMD chips have been doing zero-copy transfers on Windows for nearly a decade now on its iGPUs.
Yes, that is even on Windows 10, which is HyperV virtualized for security purposes. (The most secure parts of Windows start up in a separate VM these days, so that not even a kernel-level hack can reach those secrets... unless it also includes a VM-break of some kind)
Now don't get me wrong: XBox Series X has a weird / complicated memory scheme going on. But I'd still expect that this extremely strange memory scheme was unified, much akin to AMD's Kaveri or Intel iGPU stuffs that you'd find on any typical iGPU for the past decade.
It clearly isn't, when they wall off sections of RAM for the OS, CPU software and GPU software. Discrete memory regions implies that copying is needed between them, which means it isn't unified.
The M1 Max, at least on paper, makes every other CPU seem like a decade out of date... How can this be?
Money, mainly. Apple can afford to outspend everyone on R&D, by a
huge margin.
1. For games, the shared memory usage is relatively minor. PC has reBar resize that enabled PC CPU to directly access the entire GPU's VRAM. CPU wouldn't be able to keep up with dGPU's large-scale scatter-gather capability.
2. Shared memory has its downsides with context switch overheads. CPU IO access can gimp GPU's burst mode IO access e.g. frame buffer burst IO access shouldn't be disturbed.
Late 1980s Amiga's Chip Ram is shared memory between the CPU and iGPU (custom chips).
ReBAR doesn't have anything to do with this - it allows the CPU to write to the entire VRAM rather than smaller chunks, but the CPU still can't work off of VRAM - it needs copying to system RAM for the CPU to work on it. You're right that shared memory has its downsides, but with many times the bandwidth of any x86 CPU (and equal to many dGPUs) I doubt that will be a problem, especially considering Apple's penchant for massive caches.
That's because off hand I can't think of any other chip(s) that move such vast sums of data between massive cores, in case of Apple it's also the GPU cores now, & pay a heavy (energy) price for that. Moving (lots of) data quickly is the next big hurdle in computing & the SoC approach for now seems to be more efficient ~ the reason why it isn't directly comparable because even now the top end server chips should beat Apple in most tasks they're actually designed for but they're also generally less efficient. The SoC approach isn't really scalable beyond low double digit CPU cores especially if you're putting such a massive GPU in there!
Yes, but that's precisely why pointing out that the M1P/M are monolithic allows for huge power savings as they don't need off-die interfaces for most of this. Keeping data on silicon is a massive power savings. Of course they're working with 10 (8+2) CPU cores and an 8-"core" GPU, not a 32-64-core CPU, so the interfaces can also be much, much simpler.
They wouldn't beat amd on the same node thou.. zen 4 on 5nm will crush this expensive chip
That's debatable. Apple's architecture team is doing some incredible work the past years. Their cache architecture (which is something that doesn't gain that much from node changes) is far superior to anything else (look at the cache access benchmarks in the AnandTech article I linked above), and their
huge CPU cores have a >50% IPC lead over both Intel and AMD,
matching their performance at much lower clocks (in part thanks to those huge, low-latency caches, but not only that). A higher core count chip from AMD will still likely win in a 100% MT workload, but the power difference is likely to be significant.
Here's Anandtech's SPEC2006 and SPEC2017 testing of the M1. Those are industry standard benchmarks for ST performance, and the M1 rivals the 5950X at a fraction of the power, and much lower clocks. These chips use the same architecture but with more cache, more RAM, and a much higher power budget.