Thursday, January 14th 2021

AMD Talks Zen 4 and RDNA 3, Promises to Offer Extremely Competitive Products

AMD is always in development mode and just when they launch a new product, the company is always gearing up for the next-generation of devices. Just a few months ago, back in November, AMD has launched its Zen 3 core, and today we get to hear about the next steps that the company is taking to stay competitive and grow its product portfolio. In the AnandTech interview with Dr. Lisa Su, and The Street interview with Rick Bergman, the EVP of AMD's Computing and Graphics Business Group, we have gathered information about AMD's plans for Zen 4 core development and RDNA 3 performance target.

Starting with Zen 4, AMD plans to migrate to the AM5 platform, bringing the new DDR5 and USB 4.0 protocols. The current aim of Zen 4 is to be extremely competitive among competing products and to bring many IPC improvements. Just like Zen 3 used many small advances in cache structures, branch prediction, and pipelines, Zen 4 is aiming to achieve a similar thing with its debut. The state of x86 architecture offers little room for improvement, however, when the advancement is done in many places it adds up quite well, as we could see with 19% IPC improvement of Zen 3 over the previous generation Zen 2 core. As the new core will use TSMC's advanced 5 nm process, there is a possibility to have even more cores found inside CCX/CCD complexes. We are expecting to see Zen 4 sometime close to the end of 2021.
When it comes to RDNA 3, the company has plans to offer an architecture that has a high performance-per-watt. Just like AMD improved performance-per-watt of RDNA 2, it plans to do the same with RDNA 3, bringing the efficiency of the architecture to the first spot and making it very high-performance for any possible task.
Sources: AnandTech, The Street, via WCCFTech
Add your own comment

62 Comments on AMD Talks Zen 4 and RDNA 3, Promises to Offer Extremely Competitive Products

#26
Patr!ck
Yeah, interesting. My eyes are on Alder Lake coming later this year.
Posted on Reply
#27
HABO
Vya DomusProbably there is some of that, however Geekbench itself is without question tuned for their chips.



No it doesn't, show me an example where an x86 processor is decode bound.



There isn't a single reason to believe they would generate the same amount of instructions. And that wouldn't even mean anything, the problem is with the optimizations that the compilers themselves apply.



IPC fluctuates according to architecture, in fact it even fluctuates within the same architecture. A processor never has a constant IPC, that's quite literally impossible.

You can come up with an "average IPC" but that wouldn't mean much either.
M1 has 50% more IPC than latest Ryzen, can you see that bound ? There is no way that next Ryzen will have more than 4 decoders, there is no way that next Ryzen will have 50% IPC, can you see that bound ? This is the main change M1 has.
Posted on Reply
#28
Vya Domus
HABOM1 has 50% more IPC than latest Ryzen, can you see that bound ? There is no way that next Ryzen will have more than 4 decoders, there is no way that next Ryzen will have 50% IPC, can you see that bound ? This is the main change M1 has.
I don't think you understood the question.

Can you prove that the decoder is a limitation on an x86 processors and that it is indeed a problem ? If it was it would have been impossible to increase IPC by widening the back end of these CPUs which is exactly what AMD and Intel are doing.

On the other hand ARM inherently needs a higher decode throughput, for example, because of the lack of complex addressing modes.
Posted on Reply
#29
olymind1
With a price increase of only +50 $/€ MSRP, so our new 6 core cpus will be available at only 400 $/€

/sarcasm

Not really.
Posted on Reply
#30
efikkan
HD64GI cannot "see" any big IPC improvements from now on. 10% max from gen to gen is my prediction. Zen3 made a huge jump. Clocks and efficiency will determine the progress until new materials for transistors are used that will allow big clock jumps (graphite anyone?).
Upcoming Sapphire Rapids / Golden Cove is a major architectural overhaul, greatly extending the CPU front-end beyond Sunny Cove. I would be surprised if AMD wouldn't attemt something comparable.

AMD expect clock speeds to decrease over the next nodes, so don't expect much there.

But even if materials would allow significantly higher clock speeds, the current bottlenecks would just become more apparent. Pretty much all non-SIMD workloads scales towards cache misses and branch mispredicions. Cache misses have a nearly fixed time cost (the memory latency), so increasing the CPU clocks actually increases the relative cost of a cache miss. The cost of branch mispredicions in isolation is fixed in clock cycles, but they can cause secondary cache misses. My point is, if we don't reduce these issues, performance would be greatly hindered, even if we were able to increase clock speed significantly.

The changes we know Intel will bring is more of the same as both Sunny Cove and Zen 3 brought us; larger instruction windows, larger uop caches, larger register files, and possibly more execution ports. IPC gains are certainly still possible, I believe further generational jumps in the 20-30% range is still possible. But these changes will obviously have diminishing returns, and there are limits to how much paralellization can be extracted without either rewriting the code and/or extending the ISA. I know Intel is working on something called "threadlets", which may be solving a lot of the pipeline flushes and stalls. If this is successful, we could easily be looking at a 2-3x performance increase.
ncrsSo you're saying Apple has magic technology that makes general purpose code run on fixed-function hardware accelerators? Or did they tune their chip specifically for GeekBench? ;)
Geekbench is showcasing Apple's accelerators. This is a benchmark of various implementations, not something that translates into generic performance. Geekbench is useless for anything but showcasing special features.
DrediIPC is instructions per clock. If the performance of the new M1 in some application or benchmark, say darktable or SPEC, is the same as on the latest intel tiger lake processor, the amount of instructions in the executables is similar and the clock speed of M1 is less than that of the tiger lake processor, it means that the IPC is higher on M1 than on tiger lake.

What am I missing here?

edit: for example take these results:

<snip>
And to clarify, the above test makes no use of fixed function accelerators.
You are right about this not using special accerlation, but you're missing that this is a "pure" ALU and FPU benchmark. There is nothing preventing an ARM design from having comparable ALU or FPU performance. This only stresses a small part of the CPU, while the front-end and the caches are not stressed at all. Such benchmarks are fine to showcase various strengths and weaknesses of architectures, but they don't translate into real world performance. The best benchmark is always real workloads, and even if synthetic workloads are used, there should be a large variety of them, and preferably algorithms, not single unrealistic workloads.
HABOM1 has 50% more IPC than latest Ryzen, can you see that bound ? There is no way that next Ryzen will have more than 4 decoders, there is no way that next Ryzen will have 50% IPC, can you see that bound ? This is the main change M1 has.
Comparing IPC across ISAs means you don't even know what IPC means.

I don't believe there is anything preventing microarchitectures from Intel or AMD to have more than 4 decoders, you can pipeline many of them, and there is a good chance that upcoming Sapphire Rapids / Golden Cove actually will.
Posted on Reply
#31
dragontamer5788
TheLostSwedeSorry, but that's not IPC. The actual CPU cores aren't that fast, the reason these SoCs keep up are with the help of lots and lots of accelerators that help speed up tasks where the CPU cores are too slow to keep up. By utilising task specific co-processors (as do almost all ARM CPUs), it's possible to offer good system performance, without having great IPC.
The M1 chips have 8-wide decoders on the front end and over 700-depth to search for out-of-order execution.

In contrast, x86 chips are 4-wide decoders with only 350-ish depth for out-of-order.

-----

The "weakness" of the M1 chip is the absurd size. The M1 is far larger than the Zen chiplets but only delivers 4-cores. Apple is betting on HUGE cores for maximum IPC / single threaded performance, above and beyond what x86 (either Intel or AMD) delivers.
HABOThere is no way that next Ryzen will have more than 4 decoder
Wait, why not?

Its a question of corporate will, not technical feasibility. You can make a 16-wide decoder if you wanted. The question is if making the decoder 4x bigger is worth the 4x larger area (and yes, I'm pretty sure that parallel decoding is a O(n) problem in terms of chip size. Its non-obvious but I can discuss my proof if you're interested). It isn't necessarily 4x faster either (indeed: a 16-wide decoder wouldn't be helpful at all for small loops that are less than 16-instructions long)

AMD is clearly a "many core" kind of company. They pushed for more cores, instead of bigger cores, for Bulldozer, and even Zen. AMD could have made a bigger core than Intel for better IPC, but instead wanted to go with more cores. All Apple has done is show that there's room for bigger-cores.
Posted on Reply
#32
mechtech
Step 1 - get ample supply in shelves
Posted on Reply
#33
Punkenjoy
There is plenty of way to improve IPC in the future, we are very far from perfection right now.

Just having larger cores, with a larger front end to feed it will increase IPC. There still improvement that can be done in Branch prediction, There are still optimisation on feeding the cores with data and instructions, on prefetch algorythm to improve cache it. On cache management algorythm etc..

It's not because Intel did baby step with their tick-tock strategy that we are even close to be near the end of improving IPC. It's a matter of design choice. Very large core, more core, more cache, etc... You need to balance everything and do the right choice to get the most overall performance.

CPU design is all about choice trade off.
Posted on Reply
#34
bonehead123
HD64GTo sum it up, TSMS's manufacturing progress and capacity will be the limiting factor for the PC sector's performance progress.
Although I agree on both counts, I believe capacity is the biggest limitation right now and for the foreseeable future, at least until they can get some more fabs built & producing. I don't think that their progress is an issue, since they have publicly stated their intentions to move forward with 5, 3 & 2nm nodes as fast as possible.

But, as it stands right now, they have way more orders from existing customers than they can reasonably be expected to fulfill in a timely manner, which I guess is a somewhat good problem to have in most respects....
Posted on Reply
#35
Jomale
Zen 4 sometime close to the end of 2021

I think this is wrong, look closer:
Posted on Reply
#36
Minus Infinity
If x86 can only support 4 decoders, how did Skylake manage to have 5 decoders and why did they get rid of the 5th?
Posted on Reply
#37
Space Lynx
Astronaut
None of this matters, they won't be in stock anyway.
Posted on Reply
#38
SL2
Patr!ckZen 4 sometime close to the end of 2021

I think this is wrong, look closer:
Yeah, AMD has been launching its Ryzen models with more than 12 months between. I just looked at review dates here, as they usually are published when NDA ends (I'd assume).

1800X - 2 March 2017

2700X - 19 April 2018, 413 days after the launch before

3700X - 7 July 2018, 445 days after the launch before

5600X - 5 November 2020, 487 days after the launch before


Given that the latest Ryzen models are the most competitive for its time, and that AMD never had such a hard time meeting the demands as they do now, I'd be surprised if we'll see any major launch from AMD this year. Boring -XT models doesn't count.

7 nm production shortage would be one reason to move faster, but I'm not sure about the 5 nm production capacity anyway.
Posted on Reply
#39
Space Lynx
Astronaut
MatsYeah, AMD has been launching its Ryzen models with more than 12 months between. I just looked at review dates here, as they usually are published when NDA ends (I'd assume).

1800X - 2 March 2017

2700X - 19 April 2018, 413 days after the launch before

3700X - 7 July 2018, 445 days after the launch before

5600X - 5 November 2020, 487 days after the launch before


Given that the latest Ryzen models are the most competitive for its time, and that AMD never had such a hard time meeting the demands as they do now, I'd be surprised if we'll see any major launch from AMD this year. Boring -XT models doesn't count.

7 nm production shortage would be one reason to move faster, but I'm not sure about the 5 nm production capacity anyway.
Agree with this 100%, companies at some point have to milk their leading products, because eventually node shrinks just won't cut it
Posted on Reply
#40
HueSplat
Would love a new APU that isn't OEM only.
Posted on Reply
#41
SL2
lynx29Agree with this 100%, companies at some point have to milk their leading products, because eventually node shrinks just won't cut it
Besides, developing new generations cost a lot of money, shrinked or not. I'd guess launching less than 12 months apart isn't feasible in the long run, although there are exceptions.
Rocket Lake is supposed to be launched within less than 12 months from Comet Lake, but that's also the first departure form Skylake, and I'd guess Intel have to speed up things right now (even if it still isn't <14 nm).
fiftofarWould love a new APU that isn't OEM only.
The Pro versions are easier to find, although I have no info about it for where you live. Here in europe they're available tho.

I'd guess that the 5000 APU's will be both OEM and retail. I remember last year that AMD was talking about launching some other retail APU's at a later date.
Welcome to TPU! :toast:
Posted on Reply
#42
ValenOne
DrediYou are incorrect in your assumption. The M1 performs as well as intel’s fastest in most single threaded tasks even without the accelerators. Read the anandtech article about it: www.anandtech.com/show/16252/mac-mini-apple-m1-tested
What the accelerators do enable is truly excellent performance per watt in select use cases like watching videos or video calls, or encoding stuff.
For M1, Apple has added features to the hardware to improve the translation, hence it's approaching X86 post-RISC hybrid designs.
HABOM1 has 50% more IPC than latest Ryzen, can you see that bound ? There is no way that next Ryzen will have more than 4 decoders, there is no way that next Ryzen will have 50% IPC, can you see that bound ? This is the main change M1 has.
1. M1 has faster on-chip memory to support wider instruction issue rates. Adding extra decoders can be gimped when memory bandwidth can be the limiting factor.

2. X86 instructions can generate two RISC instructions that involve AGU and ALU, hence IPC comparison is not 1 to 1 when comparing X86 instruction set to the RISC's atomic instruction set.

Ryzen fetches its instructions from 4 decoder units and 4-to 8 instructions from OP cache. OP cache's instruction source can be replaced by extra deocder units, but it will lead to a bandwidth increase from L0/L1/L2 cache requirements.
Minus InfinityIf x86 can only support 4 decoders, how did Skylake manage to have 5 decoders and why did they get rid of the 5th?
en.wikichip.org/wiki/intel/microarchitectures/coffee_lake
Intel Coffeelake still has 5 X86 decoders.
Posted on Reply
#43
thesmokingman
MathraghGenerally in the (review) industry, IPC is being used roughly as saying "Average amount of work done per clockcycle", at least AFAIK. It is in that spirit that I was using the term and assumed you were as well. If you want to limit the use of the term "IPC" to just the actual instructions a CPU core on average can decode and process per clock then that's fine with me:) Not sure which use is best for the discussion at hand however.

Whichever way you look at it, Seeing the M1 run software that's not even compiled for the architecture, and doing it this quickly to me shows that increasing the processing per clock at least isn't an impossibility. x86 Makers might need to further virtualize their decoding hardware/stack to reach that state however.

PS: Regarding the singlecore perfomance, decoders and instructions, does your view somewhat relate to this story i just found? Exclusive: Why Apple M1 Single "Core" Comparisons Are Fundamentally Flawed (With Benchmarks) (wccftech.com)
The industry is doing it wrong or completely missing the nuance. You cannot compare the single threading performance directly because of their architectural differences. In any case it will be unfair for one or the other but in race to get those clicks, the truth gets thrown to the wayside.

www.extremetech.com/computing/318020-flaw-current-measurements-x86-versus-apple-m1-performance
Posted on Reply
#44
Mathragh
thesmokingmanThe industry is doing it wrong or completely missing the nuance. You cannot compare the single threading performance directly because of their architectural differences. In any case it will be unfair for one or the other but in race to get those clicks, the truth gets thrown to the wayside.

www.extremetech.com/computing/318020-flaw-current-measurements-x86-versus-apple-m1-performance
That article did help me understand the actual difference you're trying to convey, thanks.
However, I don't think the industry is doing it wrong. I agree that single threaded benchmarks cannot fully utilize the cores by AMD and Intel, and are thus not a completely accurate way of showing per core performance, while the M1 doesn't suffer from the same problem. Single threaded software doesn't care about this though (and usually neither does the end user), and as the article you linked also stated, this is actually a weakness of current x86 architecture cores.

If I had the choice, for the same money and total performance, between a core that can reach 100% of it's potential performance in a single thread, or a core that needs more than one thread to reach that performance level I'd chose the 1st every time. Many tasks still just don't scale to multiple threads, and would thus simply run slower on the 2nd option.

This is why the M1 is such an achievement, and even though I do agree that single threaded benchmarks don't show the full potential of the x86 cores, it doesn't matter for the most part. This is just how they perform in the real world. For multithreaded performance we've got multithreaded benchmarks, with their own set of explanations, gotcha's and peculiarities. Properties which, for the most part, also don't matter to software and the end-user.
Posted on Reply
#45
thesmokingman
MathraghThat article did help me understand the actual difference you're trying to convey, thanks.
However, I don't think the industry is doing it wrong. I agree that single threaded benchmarks cannot fully utilize the cores by AMD and Intel, and are thus not a completely accurate way of showing per core performance, while the M1 doesn't suffer from the same problem. Single threaded software doesn't care about this though (and usually neither does the end user), and as the article you linked also stated, this is actually a weakness of current x86 architecture cores.

If I had the choice, for the same money and total performance, between a core that can reach 100% of it's potential performance in a single thread, or a core that needs more than one thread to reach that performance level I'd chose the 1st every time. Many tasks still just don't scale to multiple threads, and would thus simply run slower on the 2nd option.

This is why the M1 is such an achievement, and even though I do agree that single threaded benchmarks don't show the full potential of the x86 cores, it doesn't matter for the most part. This is just how they perform in the real world. For multithreaded performance we've got multithreaded benchmarks, with their own set of explanations, gotcha's and peculiarities. Properties which, for the most part, also don't matter to software and the end-user.
That's not real world though. Real world would be like setting up an encoding job then timing and seeing which finishes first. These tests are very flawed cuz they ignore architectural differences that come into play making said test irrelevant.
Posted on Reply
#46
Mathragh
thesmokingmanThat's not real world though. Real world would be like setting up an encoding job then timing and seeing which finishes first. These tests are very flawed cuz they ignore architectural differences that come into play making said test irrelevant.
Sorry maybe I should've been more specific. With real world in this case I mean real world usage of the CPU core in it's normal-day functioning, simulated by running benchmarks that try to replicate some aspect of that real world usage (eg, encoding a movie, converting some images, running down a decision tree, simulating some physics, running a game).

How would you go about setting up a test that would accurately compare both cores on equal ground?
Edit: which also directly reflects how they are actually used by end-users.
Posted on Reply
#47
Vya Domus
MathraghSingle threaded software doesn't care about this though
There is no such thing as single threaded software these days, practically everything is written to use multiple threads.
Posted on Reply
#48
thesmokingman
MathraghSorry maybe I should've been more specific. With real world in this case I mean real world usage of the CPU core in it's normal-day functioning, simulated by running benchmarks that try to replicate some aspect of that real world usage (eg, encoding a movie, converting some images, running down a decision tree, simulating some physics, running a game).

How would you go about setting up a test that would accurately compare both cores on equal ground?
Edit: which also directly reflects how they are actually used by end-users.
Ok you seem to be rather obsessed with cores and are still missing the point. You can't directly compare the cores, let me repeat that again, ya can't compare the cores. What we can do is compare how they perform real world tasks, measure, and compare. This hasn't changed ever since ppl compared macs to pc decades ago. For ex. check comparisons like Puget.

Posted on Reply
#49
Mathragh
thesmokingmanOk you seem to be rather obsessed with cores and are still missing the point. You can't directly compare the cores, let me repeat that again, ya can't compare the cores. What we can do is compare how they perform real world tasks, measure, and compare. This hasn't changed ever since ppl compared macs to pc decades ago. For ex. check comparisons like Puget.

How odd, you're trying to make the exact point I had the feeling I was trying to make. I think something is getting lost in translation here. Ohwell, I think we're actually mostly in agreement so I'll leave it at that for now.
Posted on Reply
#50
dragontamer5788
Vya DomusThere is no such thing as single threaded software these days, practically everything is written to use multiple threads.
Pretty much all GUI-applications run their scripts / etc. etc. on the main thread though.

So even though 3d rendering is multithreaded, a lot of the "bone scripts" that 3d modelers write in Python (or whatever scripting language your 3d program supports) are single-thread bound. They could be written multithreaded, but these are 3d artists who are writing a lot of this stuff, not necessarily expert programmers. Rigging, import/export scripts, game animations, etc. etc. A lot of these things end up on a single thread.

People want both: single thread and multithreaded power.
Posted on Reply
Add your own comment
Jun 10th, 2024 16:35 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts