Wednesday, July 14th 2021
AMD Zen 4 Desktop Processors Likely Limited to 16 Cores, 170 W TDP
We have recently seen several reputable rumors confirming that AMD's Zen 4 Raphael desktop processors will be limited to 16 cores with 2 compute units. There were previous rumors of a 24 core model with 3 compute units however that now seems unlikely. While the core counts won't increase some skews may see a TDP increase up to 170 W which should offer some performance uplift. AMD is expected to debut their 5 nm Zen 4 Raphael desktop processors in 2022 which will come with support for PCIe 5.0 and DDR5. The processors will switch to a new AM5 LGA1718 socket and will compete with Intel's Alder Lake-S successor Raptor Lake which could feature 24 cores.
Source:
@patrickschur_
75 Comments on AMD Zen 4 Desktop Processors Likely Limited to 16 Cores, 170 W TDP
Looks like AMD went from "moar cores" to "moar cache" and that's good. 16 MB L3 on the 11900K is a laughing stock.
128kB L1 cache and double the execution width / reorder buffer size (reorder 700-instructions vs 300ish on AMD Zen / Intel Skylake) and Apple M1 proves that there's still a market for "fewer, better cores". I reject discussion points about the x86 decoder width (ARM had a smaller decoder width than x86 for years. The reason why no one made an 8-way decoder was that no one thought there was a market for an 8-way decoder IMO. If Intel / AMD wanted to do it, I'm pretty sure they can do it).
Wider cores vs more cores vs SIMD-width is an interesting problem. There's lots of different ways to configure a CPU-core, and this competition is quite exciting. We're seeing different designs again, after years of stagnation.
Adding L3 cache (the "Stacked" SRAM) to 96MB per chip will likely improve instructions-per-count even if the cores are unchanged. In fact, Apple's M1 chip is said to have some of the best "uncore" features (features / benchmarks from outside of a core), such as ARM's relaxed memory model PLUS support for total-store order for those x86 emulators (Rosetta). Also Apple's chip seems to have among the best latency to/from its DRAM modules.
So even if "cores" are stagnating, there are many ways to improve a chip. AMD's I/O chip is clearly a bottleneck (that solved other bottlenecks). I'm sure that future advancements in that I/O chip will have dramatic improvements to Ryzen / Threadripper / EPYC, even if the cores themselves remain mostly unchanged. And even then: I expect AMD will also be working on improving those cores. The march of progress never stops in the tech world.
I mostly use the M1 as proof that single-threaded performance can get better. Doubling the executing pipelines and decoder width is the "obvious" way to improve single-threaded performance... and could very well apply to Intel / AMD if they had the will to do it. I'm not sure if the tradeoff is worth it however: having 8-cores of Zen3 size vs 4-cores of M1 size is... probably more beneficial to the 8-core side.
But I don't expect the consumer market to go beyond 16 cores yet. Having 32-cores of Zen3 size vs 16-cores of M1 size... well... that probably is beneficial to the M1, because not even x265 works well above 16-threads.
There's a bit of Ahmdal's law going on, and a bit of Gustafson's law going on.
And the obvious.
On the other hand, I would take a much better single-core performance indeed, I use cases where it's more important!
My cpu was clocked at 4Ghz I also have a i7-920 I swapped in for testing and with the same setup I had much lower frame rate and drops you clearly felt the two missing cores. That is from my first hand experience and you going to tell me i'm wrong?
AMD in 2 gens, made a bigger % change than intel did in 5 gens
AMD got us from 4 cores average to 8 cores average pretty fast, and then focused on making the cores faster. That works for me (and most people) because the majority of software has yet to catch up to use all those extra threads yet (which is why 6 cores like the 5600x are still amazing for gaming)
Honestly, what normal person, even a power user would want more than 16 cores. They make threadripper for a reason, you pay to play if you need workstation class CPU.
Even as someone that runs fluid sims and such, my next CPU will be 12 core 6900X.
And that also prove my point too, you talk about resolution and graphics. Like i said, things that can be easily parallelized are being run on accelerator like GPU.
It's true that you can increase the difficulty of the work of each section that can run independently of others, but you will still be limited by how fast you can run the single threaded portion of the code. By example. in your chess example, you still have to determine what are the move that you want to go with. You also still have to allocate all the move to all the cores so they can calculate it so they don't do duplicate work.
In the end, the workload that CPU will continue to run effectively in the future are code that tend to be more branch dependent or N+1 problem. Problem that are easily parallelized will be either accelerated using accelerator like GPU, fixed function like Quicksync or wide SIMD like AVX512 (That can even be widened further).
Most movie renderers remain CPU-renderers today, because the GBs of texture + vertex data do not fit on the a measly 20GBs. Ex: Moana scene is like 93GBs base + 130GBs for animations (www.disneyanimation.com/resources/moana-island-scene/). We all know raytracing is best on raytracing-accelerated GPUs, but all that special hardware doesn't matter if the scene literally doesn't even fit in its RAM.
And Moana was rendered over 5 years ago. Today's movies are bigger and more detailed.
And before you say it: yeah, I know about the NVidia DGX + NVSwitch. But that'd still require a "remote access" of RAM if you were to distribute the scenes out to that architecture. It'd probably work but I don't think any such renderer exists yet. There's some fun blogposts about people trying to make the CPU+GPU team work across movie-scale datasets like the Moana scene (CPU acts as a glorified RAM-box, passing the needed data to the GPU. GPU renders the scene in small pieces that can fit inside of 8GB or 16GB chunks). But that's the stuff of research, and not practice.