Thursday, January 14th 2021
AMD Talks Zen 4 and RDNA 3, Promises to Offer Extremely Competitive Products
AMD is always in development mode and just when they launch a new product, the company is always gearing up for the next-generation of devices. Just a few months ago, back in November, AMD has launched its Zen 3 core, and today we get to hear about the next steps that the company is taking to stay competitive and grow its product portfolio. In the AnandTech interview with Dr. Lisa Su, and The Street interview with Rick Bergman, the EVP of AMD's Computing and Graphics Business Group, we have gathered information about AMD's plans for Zen 4 core development and RDNA 3 performance target.
Starting with Zen 4, AMD plans to migrate to the AM5 platform, bringing the new DDR5 and USB 4.0 protocols. The current aim of Zen 4 is to be extremely competitive among competing products and to bring many IPC improvements. Just like Zen 3 used many small advances in cache structures, branch prediction, and pipelines, Zen 4 is aiming to achieve a similar thing with its debut. The state of x86 architecture offers little room for improvement, however, when the advancement is done in many places it adds up quite well, as we could see with 19% IPC improvement of Zen 3 over the previous generation Zen 2 core. As the new core will use TSMC's advanced 5 nm process, there is a possibility to have even more cores found inside CCX/CCD complexes. We are expecting to see Zen 4 sometime close to the end of 2021.When it comes to RDNA 3, the company has plans to offer an architecture that has a high performance-per-watt. Just like AMD improved performance-per-watt of RDNA 2, it plans to do the same with RDNA 3, bringing the efficiency of the architecture to the first spot and making it very high-performance for any possible task.
Sources:
AnandTech, The Street, via WCCFTech
Starting with Zen 4, AMD plans to migrate to the AM5 platform, bringing the new DDR5 and USB 4.0 protocols. The current aim of Zen 4 is to be extremely competitive among competing products and to bring many IPC improvements. Just like Zen 3 used many small advances in cache structures, branch prediction, and pipelines, Zen 4 is aiming to achieve a similar thing with its debut. The state of x86 architecture offers little room for improvement, however, when the advancement is done in many places it adds up quite well, as we could see with 19% IPC improvement of Zen 3 over the previous generation Zen 2 core. As the new core will use TSMC's advanced 5 nm process, there is a possibility to have even more cores found inside CCX/CCD complexes. We are expecting to see Zen 4 sometime close to the end of 2021.When it comes to RDNA 3, the company has plans to offer an architecture that has a high performance-per-watt. Just like AMD improved performance-per-watt of RDNA 2, it plans to do the same with RDNA 3, bringing the efficiency of the architecture to the first spot and making it very high-performance for any possible task.
62 Comments on AMD Talks Zen 4 and RDNA 3, Promises to Offer Extremely Competitive Products
Can you prove that the decoder is a limitation on an x86 processors and that it is indeed a problem ? If it was it would have been impossible to increase IPC by widening the back end of these CPUs which is exactly what AMD and Intel are doing.
On the other hand ARM inherently needs a higher decode throughput, for example, because of the lack of complex addressing modes.
/sarcasmNot really.
AMD expect clock speeds to decrease over the next nodes, so don't expect much there.
But even if materials would allow significantly higher clock speeds, the current bottlenecks would just become more apparent. Pretty much all non-SIMD workloads scales towards cache misses and branch mispredicions. Cache misses have a nearly fixed time cost (the memory latency), so increasing the CPU clocks actually increases the relative cost of a cache miss. The cost of branch mispredicions in isolation is fixed in clock cycles, but they can cause secondary cache misses. My point is, if we don't reduce these issues, performance would be greatly hindered, even if we were able to increase clock speed significantly.
The changes we know Intel will bring is more of the same as both Sunny Cove and Zen 3 brought us; larger instruction windows, larger uop caches, larger register files, and possibly more execution ports. IPC gains are certainly still possible, I believe further generational jumps in the 20-30% range is still possible. But these changes will obviously have diminishing returns, and there are limits to how much paralellization can be extracted without either rewriting the code and/or extending the ISA. I know Intel is working on something called "threadlets", which may be solving a lot of the pipeline flushes and stalls. If this is successful, we could easily be looking at a 2-3x performance increase. Geekbench is showcasing Apple's accelerators. This is a benchmark of various implementations, not something that translates into generic performance. Geekbench is useless for anything but showcasing special features. You are right about this not using special accerlation, but you're missing that this is a "pure" ALU and FPU benchmark. There is nothing preventing an ARM design from having comparable ALU or FPU performance. This only stresses a small part of the CPU, while the front-end and the caches are not stressed at all. Such benchmarks are fine to showcase various strengths and weaknesses of architectures, but they don't translate into real world performance. The best benchmark is always real workloads, and even if synthetic workloads are used, there should be a large variety of them, and preferably algorithms, not single unrealistic workloads. Comparing IPC across ISAs means you don't even know what IPC means.
I don't believe there is anything preventing microarchitectures from Intel or AMD to have more than 4 decoders, you can pipeline many of them, and there is a good chance that upcoming Sapphire Rapids / Golden Cove actually will.
In contrast, x86 chips are 4-wide decoders with only 350-ish depth for out-of-order.
-----
The "weakness" of the M1 chip is the absurd size. The M1 is far larger than the Zen chiplets but only delivers 4-cores. Apple is betting on HUGE cores for maximum IPC / single threaded performance, above and beyond what x86 (either Intel or AMD) delivers. Wait, why not?
Its a question of corporate will, not technical feasibility. You can make a 16-wide decoder if you wanted. The question is if making the decoder 4x bigger is worth the 4x larger area (and yes, I'm pretty sure that parallel decoding is a O(n) problem in terms of chip size. Its non-obvious but I can discuss my proof if you're interested). It isn't necessarily 4x faster either (indeed: a 16-wide decoder wouldn't be helpful at all for small loops that are less than 16-instructions long)
AMD is clearly a "many core" kind of company. They pushed for more cores, instead of bigger cores, for Bulldozer, and even Zen. AMD could have made a bigger core than Intel for better IPC, but instead wanted to go with more cores. All Apple has done is show that there's room for bigger-cores.
Just having larger cores, with a larger front end to feed it will increase IPC. There still improvement that can be done in Branch prediction, There are still optimisation on feeding the cores with data and instructions, on prefetch algorythm to improve cache it. On cache management algorythm etc..
It's not because Intel did baby step with their tick-tock strategy that we are even close to be near the end of improving IPC. It's a matter of design choice. Very large core, more core, more cache, etc... You need to balance everything and do the right choice to get the most overall performance.
CPU design is all about choice trade off.
But, as it stands right now, they have way more orders from existing customers than they can reasonably be expected to fulfill in a timely manner, which I guess is a somewhat good problem to have in most respects....
I think this is wrong, look closer:
1800X - 2 March 2017
2700X - 19 April 2018, 413 days after the launch before
3700X - 7 July 2018, 445 days after the launch before
5600X - 5 November 2020, 487 days after the launch before
Given that the latest Ryzen models are the most competitive for its time, and that AMD never had such a hard time meeting the demands as they do now, I'd be surprised if we'll see any major launch from AMD this year. Boring -XT models doesn't count.
7 nm production shortage would be one reason to move faster, but I'm not sure about the 5 nm production capacity anyway.
Rocket Lake is supposed to be launched within less than 12 months from Comet Lake, but that's also the first departure form Skylake, and I'd guess Intel have to speed up things right now (even if it still isn't <14 nm). The Pro versions are easier to find, although I have no info about it for where you live. Here in europe they're available tho.
I'd guess that the 5000 APU's will be both OEM and retail. I remember last year that AMD was talking about launching some other retail APU's at a later date.
Welcome to TPU! :toast:
2. X86 instructions can generate two RISC instructions that involve AGU and ALU, hence IPC comparison is not 1 to 1 when comparing X86 instruction set to the RISC's atomic instruction set.
Ryzen fetches its instructions from 4 decoder units and 4-to 8 instructions from OP cache. OP cache's instruction source can be replaced by extra deocder units, but it will lead to a bandwidth increase from L0/L1/L2 cache requirements.en.wikichip.org/wiki/intel/microarchitectures/coffee_lake
Intel Coffeelake still has 5 X86 decoders.
www.extremetech.com/computing/318020-flaw-current-measurements-x86-versus-apple-m1-performance
However, I don't think the industry is doing it wrong. I agree that single threaded benchmarks cannot fully utilize the cores by AMD and Intel, and are thus not a completely accurate way of showing per core performance, while the M1 doesn't suffer from the same problem. Single threaded software doesn't care about this though (and usually neither does the end user), and as the article you linked also stated, this is actually a weakness of current x86 architecture cores.
If I had the choice, for the same money and total performance, between a core that can reach 100% of it's potential performance in a single thread, or a core that needs more than one thread to reach that performance level I'd chose the 1st every time. Many tasks still just don't scale to multiple threads, and would thus simply run slower on the 2nd option.
This is why the M1 is such an achievement, and even though I do agree that single threaded benchmarks don't show the full potential of the x86 cores, it doesn't matter for the most part. This is just how they perform in the real world. For multithreaded performance we've got multithreaded benchmarks, with their own set of explanations, gotcha's and peculiarities. Properties which, for the most part, also don't matter to software and the end-user.
How would you go about setting up a test that would accurately compare both cores on equal ground?
Edit: which also directly reflects how they are actually used by end-users.
So even though 3d rendering is multithreaded, a lot of the "bone scripts" that 3d modelers write in Python (or whatever scripting language your 3d program supports) are single-thread bound. They could be written multithreaded, but these are 3d artists who are writing a lot of this stuff, not necessarily expert programmers. Rigging, import/export scripts, game animations, etc. etc. A lot of these things end up on a single thread.
People want both: single thread and multithreaded power.