Tuesday, September 29th 2020

First Signs of AMD Zen 3 "Vermeer" CPUs Surface, Ryzen 7 5800X Tested

AMD is preparing to launch the new iteration of desktop CPUs based on the latest Zen 3 core, codenamed Vermeer. On October 8th, AMD will hold the presentation and again deliver the latest technological advancements to its desktop platform. The latest generation of CPUs will be branded as a part of 5000 series, bypassing the 4000 series naming scheme which should follow, given that the prior generation was labeled as 3000 series of processors. Nonetheless, AMD is going to bring a new Zen 3 core with its processors, which should bring modest IPC gains. It will be manufactured on TSMC's 7 nm+ manufacturing node, which offers a further improvement to power efficiency and transistor density.

Today, we have gotten the first benchmark of AMD's upcoming Ryzen 7 5800X CPU. Thanks to the popular hardware leaker, TUP APISAK, we have the first benchmark of the new Vermeer processor, compared to Intel's latest and greatest - Core i9-10900K. The AMD processor is an eight-core, sixteen threaded model compared to the 10C/20T Intel processor. While we do not know the final clocks of the AMD CPU, we could assume that the engineering sample was used and we could see an even higher performance. Below you can see the performance of the CPU and how it compares to Intel. By the numbers shown, we can expect AMD to possibly be a new gaming king, as the numbers are very close to Intel. The average batch result for the Ryzen 7 5800X was 59.3 FPS and when it comes to CPU frames it managed to score 133.6 FPS. Intel's best managed to average 60.3 FPS and 114.8 FPS from the CPU framerates. Both systems were tested with NVIDIA's GeForce RTX 2080 GPUs.
Source: @TUM_APISAK (Twitter)
Add your own comment

82 Comments on First Signs of AMD Zen 3 "Vermeer" CPUs Surface, Ryzen 7 5800X Tested

#51
Turmania
Just waiting for 8 core 16 thread cpu that can boost to mininum 5 ghz on all cores and stay there on air cooling solution without overclock. I would prefer if it is AMD since they have clock advantage per clock. Till then, i see no point upgrading at least for my main system.
Posted on Reply
#52
Rahnak
SteevoI remember the debate, as no one uses 720 unless we are comparing low power CPUs in tablets.

So feel free to use a resolution that's unused for a comparison of you feel better about it. But it's like comparing which jet fighter is better at being a submarine. Or which sports car does best off-road. Or which network switch makes the best cricket bat.
If you remember the debate, then surely you remember the reason why tests at lower resolutions are relevant. And again, I said 1080p and 1440p.
SteevoAgain, you addressed one point, that is meaningless. Any thoughts on core counts, memory latency, power consumption? Typically AMD gets you more overall performance for the dollar.
I addressed the only point relevant in the article. It's a gaming benchmark. At 4K. Which says very little of CPU performance and that was my point. It doesn't say anything about Zen 3's memory latency, power consumption or prices, so I have no opinion on those.
Posted on Reply
#53
ZoneDymo
Freaky_Snuke*RDNA 1 and Zen 3 naming scheme consumer confusion incoming*
But by the time those cpus come out RDNA 1 is gonna be mostly irrelevant anyway.

If that turns out to be true an all AMD truly high-end gaming rig might become a reality for the first time since like... decades this end of year.
Well in fairness an Intel or Nvidia truely high end gaming rig has never been a reality.
Posted on Reply
#54
RandallFlagg
RahnakThe fact that the benchmark was run at 4K rather than 1440p or 1080p is a little suspicious. And the fact that while having much higher cpu frames, it was still marginally behind in actual framerate in 2 out of 3.
That - why run a game at 4k as a CPU test as that gets GPU limited - and also that Ashes was developed in partnership with AMD, and originally ran much better on R9 290s than on any of that generation's Nvidia cards.

Then you have that the 10900k only lost in synthetic 'cpu framerate', it won in 2 out of 3 on actual framerate (which is what you'd actually see)...

This really looks more like a planned marketing stunt than an objective benchmark to me. We will know in a few weeks either way.
Posted on Reply
#55
InVasMani
DemonicRyzen666monolithic die
www.pcgamesn.com/amd/ryzen-5-4650g-benchmark-3600

www.tomshardware.com/reviews/amd-ryzen-7-pro-4750g-renoir-review/5


Lower latency


www.techpowerup.com/269223/amd-ryzen-7-4700ge-memory-benchmarked-extremely-low-latency-explains-tiny-l3-caches

None of it seems to do anything for Ryzen.
Yeah it's a odd discrepancy at first glance 16GB vs 32GB. It would seem that TUP APISAK might've chosen that comparison to show AMD's performance with a higher density module in play to not only highlight the higher performance of the AMD chip, but also glean into memory latency playing a role with it. The highest density ram modules often require looser latency which could what is being represented here. If the performance advantages on the new Ryzen chip being portrayed here is coming from the larger ram density that would be the worst case scenario and a bit unlikely, but with a limited amount of benchmarks to compare between both chips paired with that GPU module could perhaps be the case. This could simply be the closest comparison that could be compared at present by the leaker tough to say.
RandallFlaggThat - why run a game at 4k as a CPU test as that gets GPU limited - and also that Ashes was developed in partnership with AMD, and originally ran much better on R9 290s than on any of that generation's Nvidia cards.

Then you have that the 10900k only lost in synthetic 'cpu framerate', it won in 2 out of 3 on actual framerate (which is what you'd actually see)...

This really looks more like a planned marketing stunt than an objective benchmark to me. We will know in a few weeks either way.
My take on it is this 4K is actually more CPU computational than 1080p, but it's a harder and less exciting to benchmark and account for. It would be interesting perhaps to place a 30FPS/45FPS/60FPS GPU limit and do some PhysX testing assigned to the CPU across 1080p up thru 8K and seeing what the scaling is ends up like and if it's linear or more non-linear. I don't see how it could be and seems it would vary and fluctuate a lot depending on the type of scene. It would be rather insightful and interesting see which things present more bottlenecks in the CPU design more for PhysX as well. Seeing just how much multi-core performance impacts PhysX would be cool a well that might show a upside to AMD's design if heavy use of PhysX can be exploited by developers. If there is advantages to the multi-core approach for stuff like PhysX it just goes to show you AMD's approach should only continue to blossom further in those area's moving forward especially true since Intel has followed suit in order to try to keep pace with it. If anything that's a clear indicator that Intel knows the vital importance of the multi-core design approach and if they simply stuck with a quad core they'd already be left in the dust. In fact I want to see how Intel's chips perform limited to 4c/8t versus AMD's latest Ryzen chips let's just see where Intel would be if they didn't grudgingly glue sh*t together at 14nm+++++++++++++++ today because of AMD.
Posted on Reply
#56
Bansaku
CmdrLawWatching this intently.



Did a build for a friend recently, they wanted to go Intel and the 10850K OCing very easily @5.3 All core on 10 cores was a hell of an incentive to switch back to blue.
Does your friend pay their own power bill, because at that clock speed the CPU is pulling well over 300W! And at that speed, what does it REALLY do for his gaming experience? Gaming @ 1440P/4K 60Hz I saw little to no performance difference between my old i7 3770K and my new 3700X, despite the 4x the benchmark scores.
Posted on Reply
#57
arbiter
InVasManiYeah it's a odd discrepancy at first glance 16GB vs 32GB. It would seem that TUP APISAK might've chosen that comparison to show AMD's performance with a higher density module in play to not only highlight the higher performance of the AMD chip, but also glean into memory latency playing a role with it. The highest density ram modules often require looser latency which could what is being represented here. If the performance advantages on the new Ryzen chip being portrayed here is coming from the larger ram density that would be the worst case scenario and a bit unlikely, but with a limited amount of benchmarks to compare between both chips paired with that GPU module could perhaps be the case. This could simply be the closest comparison that could be compared at present by the leaker tough to say.
That assumes they wouldn't use most expensive ram for their side and cheapest brand for the other. AMD has in the history pull shenanigans with their benchmark releases so i would say this isn't outside the realm of possible to happen. The benchmark doesn't tell us what timings used and mhz the ram is running at so.
Posted on Reply
#58
InVasMani
arbiterThat assumes they wouldn't use most expensive ram for their side and cheapest brand for the other. AMD has in the history pull shenanigans with their benchmark releases so i would say this isn't outside the realm of possible to happen. The benchmark doesn't tell us what timings used and mhz the ram is running at so.
It's a unofficial benchmark comparison it really doesn't matter at this point and pricing between both could change at any point between now and launch. I get what you're alluding to and yeah obviously memory latency and density can skew perceptions and AMD has pulled shenanigans as has Intel and Nvidia. It's a common industry trend they all do it. Wait til things are verified and the dust settles. I'm sure I'll be satisfied with Zen 3 to be honest it certainly can't be any worse than Zen 2 which itself isn't bad.
Posted on Reply
#59
Unregistered
efikkanJust because a piece of software runs better on one CPU doesn't mean it's optimized for it, it could be that a the hardware just handles the workload better due to resource balancing and advantages of that architecture, advantages which usually are hard or impossible to exploit directly from software.
You can target the strengths of one architure and the program will run faster on it.
This guy made different workloads and run them on a Phenom and i7 8th gen, even though the phenom is so ol it's still faster in some:
I find it hard to believe that game engines don't do that at least to some extent.
Posted on Edit | Reply
#60
RandallFlagg
Xex360You can target the strengths of one architure and the program will run faster on it.
This guy made different workloads and run them on a Phenom and i7 8th gen, even though the phenom is so ol it's still faster in some:
I find it hard to believe that game engines don't do that at least to some extent.
Yep, and it's also possible to do the reverse - design hardware to run specific instructions or even a specific sequence of instructions very quickly. You could target your CPU to a use case where you have multiple threads doing the exact same thing to different parts of a large data set where said threads did not need to interact with each others data set much.

For example, Cinebench.
Posted on Reply
#61
arbiter
RandallFlaggYep, and it's also possible to do the reverse - design hardware to run specific instructions or even a specific sequence of instructions very quickly. You could target your CPU to a use case where you have multiple threads doing the exact same thing to different parts of a large data set where said threads did not need to interact with each others data set much.

For example, Cinebench.
The game they used had direct AMD funding for a lot of it and when you look at player charts that game only gets 60-70 players avg so really not good metric to use a game no one plays. As other guy said you can code things for a certain cpu and get great results. Apple used to do that same thing back when they used PowerPC processors to make them look better then PC x86 machines.
Posted on Reply
#63
RandallFlagg
arbiterThe game they used had direct AMD funding for a lot of it and when you look at player charts that game only gets 60-70 players avg so really not good metric to use a game no one plays. As other guy said you can code things for a certain cpu and get great results. Apple used to do that same thing back when they used PowerPC processors to make them look better then PC x86 machines.
I know, I agree 100%. For people who know the history of Ashes (I was one of the pre-release buyers) it is one of the most suspect benchmarks. What was particularly embarrassing for AMD in regards to Ashes was how despite their partnership in creating the game and the use of the AMD Vulkan API, when Pascal (10xx series) came out they got obliterated in Ashes anyway.

Looking beyond the surface and clickbait article titles of this "leak" - if Zen 3 is so good, why is an AMD co-sponsored title being used at 4k for pre-release hype and still losing in actual FPS 2/3 of the time? And why are both recent leaks - one on 5700U a week ago and now this one on 5800X - for that *same* AMD sponsored title which very few play regularly? Why not use something a bit more mainstream at settings that don't go GPU limited? Hmmm.....
Posted on Reply
#64
InVasMani
Parallel single threading seems entirely plausible phase the clock skew peaks and dips on two chips and synchronize oscillation switching between one and the other. You should get 100% increase in performance with two chips like that in theory, but clock skew frequency oscillation is always in constant motion so you move from peaks to dips so with the switching in mind to maximize both you end up 50% in the best case scenario though synchronizing and sequencing it might not be 100% perfect so could be closer to 48%. I don't know if they can execute it perfectly in practice, but in theory it's defiantly within the scope of possibilities. You can actually mimic that with a pair of music sequencers it's functionally possible.

I mentioned the concept of it in the Intel bigLITTLE TPU thread not that far back you can basically manipulate clock skew or cycle duties in a clever manner in theory to get more performance by manipulating it in a similar fashion to what was done with by MOS Technology with the SID chip for the arpeggio's to simulate playing chords with polyphony it was a clever hardware trick at the time. It seems far fetched and somewhat unimaginable to actually be applied, but innovation always is you have to think outside the box or you'll always been stuck in a box.


This is a quadruple LFO what is allegedly being done is twin LFO if you look at the intersection points that's half a cycle duty rising and falling voltages/frequencies. If you look at the blue and green or yellow and purple they intersect perfectly. What's being done is a switching at the intersection cross section so you've got two valley peaks closer together and the base of the mountain so to speak isn't as far downward. That's assuming this is in fact being done and put into practice by AMD. I see it within the oscilloscope of possibilities for certain. That's basically what DDR memory did in practice. Big question is if they can pull it off within the dynamic complexity of software. Then again why can't they!!? Can't see what they can't divert it like a rail road track at that crossroad intersection point. That nets you a roughly 50% performance gain with 4 chips the valley dips would be reduce more and the peaks would happen more routinely and you'd end up with 100% more performance I think that's what DDR5 is suppose to do actually on the data rate hence the phrase quad data rate.


Thinking about it further I really don't see a problem with the I/O die managing that type of load switching in real time quickly and the data would already be present in the CPU memory it's not like it gets flushed instantly. Yeah maybe it could become a bit of a materialized reality. If not now certainly later. I have to think AMD will incorporate a I/O for the GPU soon as well if they want to pursue multi-chip GPU's.
Posted on Reply
#65
PanicLake
arbiterThe game they used had direct AMD funding for a lot of it and when you look at player charts that game only gets 60-70 players avg so really not good metric to use a game no one plays. As other guy said you can code things for a certain cpu and get great results. Apple used to do that same thing back when they used PowerPC processors to make them look better then PC x86 machines.
Yes but if you compare the 5800X with the "old" 3800X, it is still a big improvement...
Crazy 4K BatchRyzen 7 5800XRyzen 7 3800XCore i9-10900K
Normal167fps125fps136fps
Medium135fps111fps119fps
Heavy110fps87fps96fps
Posted on Reply
#66
RandallFlagg
PanicLakeYes but if you compare the 5800X with the "old" 3800X, it is still a big improvement...
Crazy 4K BatchRyzen 7 5800XRyzen 7 3800XCore i9-10900K
Normal167fps125fps136fps
Medium135fps111fps119fps
Heavy110fps87fps96fps
That should be "CPU Framerate" not "FPS".

If this were a car, what you are doing would be like calculating 0-60 time based on engine HP and car weight, while ignoring *actual* 0-60 time. No one does that. In real FPS Ryzen 3800X loses all 3 and 5800X loses 2 out of 3.
Posted on Reply
#67
arbiter
RandallFlaggThat should be "CPU Framerate" not "FPS".

If this were a car, what you are doing would be like calculating 0-60 time based on engine HP and car weight, while ignoring *actual* 0-60 time. No one does that. In real FPS Ryzen 3800X loses all 3 and 5800X loses 2 out of 3.
Even all those numbers say amd is faster but then you look at "avg (all batches)" Intel win's cpu frame rate still and has a 5900 score vs 5800 of amd IN a benchmark that is known to favor AMD. So to me those numbers in general mean NOTHING. They need to get a Benchmark that isn't slanted instead of a game that is pretty much a glorfied tech demo for their hardware.
Posted on Reply
#68
DemonicRyzen666
RandallFlaggThat should be "CPU Framerate" not "FPS".

If this were a car, what you are doing would be like calculating 0-60 time based on engine HP and car weight, while ignoring *actual* 0-60 time. No one does that. In real FPS Ryzen 3800X loses all 3 and 5800X loses 2 out of 3.
They already do that for cars when they build them it's called estimated 0-60 times, and they do it with computers simulations.

Hell some cars are so fast they don't even do 0-60 mph anymore they do 0-100mph.
arbiterEven all those numbers say amd is faster but then you look at "avg (all batches)" Intel win's cpu frame rate still and has a 5900 score vs 5800 of amd IN a benchmark that is known to favor AMD. So to me those numbers in general mean NOTHING. They need to get a Benchmark that isn't slanted instead of a game that is pretty much a glorfied tech demo for their hardware.
How is it AMD glorified if intel is winning ?

Where do you see 5900x ?, this 5800X 8 core 16 thread vs 10 core 20 thread.

The game is suppose to be really good at using multi thread it even shows the Threadripper 3960x is quite good on it
Posted on Reply
#69
efikkan
Xex360You can target the strengths of one architure and the program will run faster on it.
This guy made different workloads and run them on a Phenom and i7 8th gen, even though the phenom is so ol it's still faster in some:
I find it hard to believe that game engines don't do that at least to some extent.
Sure, down to single instructions can be slightly faster or slower on various architectures. In my tests, I've seen some cases where Haswell is slower than Sandy Bridge, but in most cases it's faster. The problem here is that this is a benchmark of a single operation in a loop, this is a synthetic test case which will exaggerate the real world difference. The reason why he runs the loop 1.000.000.000.000 times is to get a measurable difference. Also, it's not like these operations are different alternatives to solve the same problem. It's not unlikely that you can find older architectures which can do certain simple operations like this faster, while modern architectures are optimized for saturating several execution ports and doing a mix of various types of operations. This is why such benchmarks can be very misguiding.

When doing real optimization of code, it's common to benchmark whole algorithms or larger pieces of code to see the real world difference of different approaches. It's very rare that you'll find a larger piece of code that performs much better on Skylake and a competing alternative which performs much better on let's say Zen 2. Any difference that you'll find for single instructions will be less important than the overall improvements of the architecture. And it's not like there will be an "Intel optimization", Intel has changed the resource balancing for every new architecture, so has AMD.

Interestingly the sample code in that video scales poorly with many cores, but should be able to scale nearly linearly if the work queue is implemented smarter.
InVasManiParallel single threading seems entirely plausible phase the clock skew peaks and dips on two chips and synchronize oscillation switching between one and the other. <snip>
Instruction level parallelism is already heavily used, there is no need to spread the ALUs, FPUs, etc. across several cores, the distance would make a synchronization nightmare. We should expect future architectures to continue to scale their superscalar abilities. But I don't doubt that someone will find a clever way to utilize "idle transistors" in some of these by manipulating clock cycles etc.

The problem with superscalar scaling is keeping execution units fed. Both Intel and AMD currently have four integer pipelines. Integer pipelines are cheap (both in transistors and power usage), so why not double or quadruple them? Because they would struggle to utilize them properly. Both of them have been increasing instruction windows with every generation to try to exploit more parallelism, and Intel's next gen Sapphire Rapids/Golde Cove is allegedly featuring a massive 800 entry instruction window (Skylake has 224, Sunny Cove 352 for comparison). And even with these massive CPU front-ends, execution units are generally under-utilized due to branch mispredictions and cache misses. Sooner or later the ISA needs to improve to help the CPU, which should be theoretically possible, as the compiler has much more context than is passed on through the x86 ISA, as well as eliminating more branching.
Posted on Reply
#70
dragontamer5788
efikkanSooner or later the ISA needs to improve to help the CPU, which should be theoretically possible, as the compiler has much more context than is passed on through the x86 ISA, as well as eliminating more branching.
I'm not sure how much a compiler can help:

if(blah()){
foo();
} else {
bar();
}

The above is the easy case. There's lots of pattern matching and heuristics that help the pipelines figure out if foo() needs to be shoved into the pipelines, or if bar() needs to be shoved into the pipelines (while calculating blah() in parallel).

Now consider the following instead:

for(int i=0; i<array.size(); i++){
array->virtualFunctionCall();
}


You simply can't "branch predict" the virtualFunctionCall() much better than what we're doing today. Today, there are ~4 or 5 histories stored into the Branch Target Buffer (BTB), so the most common 3 or 4 classes will have their virtualFunctionCall() successfully branch-predicted without much issue. There are also 3 levels of branch predictor pattern-matchers running in parallel, giving the CPU three different branch targets (L1 branch predictor is fastest but least accurate. L3 branch predictor is most accurate but almost the slowest: only slightly faster than a mispredicted branch).

This demonstrates the superiority of runtime information (if there's only 2 or 3 classes in the array[], the CPU will branch predict the virtualFunctionCall() pretty well). The compiler cannot make any assumptions about the contents of array.

---------


By the way: most "small branches" are compiled into CMOV sequences on x86, no branch at all.

--------------

The only things being done grossly different seem to be the GPU architectures, which favor no branch prediction at all, and instead just focus on wider-and-wider SMT to fill their pipelines (and non-uniform branches are very, very inefficient because of thread divergence. Uniform branches are efficient on both CPUs and GPUs, because CPUs will branch-predict a uniform branch while GPUs will not have any divergence). Throughput vs Latency strikes again: GPUs can optimize throughput but CPUs must optimize latency to be competitive.
Posted on Reply
#71
efikkan
dragontamer5788You simply can't "branch predict" the virtualFunctionCall() much better than what we're doing today.
Of course not, you will never be able to do that, that's not what I meant.
I was thinking of branching logic inside a single scope, like a lot of ifs in a loop. Compilers already turn some of these into branchless alternatives, but I'm sure there is more potential here, especially if the ISA could express dependencies so the CPU could do things out of order more efficiently and hopefully some day limit the stalls in the CPU. As you know, with ever more superscalar CPUs, the relative cost of a cache miss or branch misprediction is growing.
Ideally code should be free of unnecessary branching, and there are a lot of clever tricks with and without AVX, which I believe we have discussed previously.

But about your virtual function calls. If your critical path is filled with virtual function calls and multiple levels of inheritance, you're pretty much screwed performance wise, no compiler will be able to untangle this at compile time. And in most cases (at least how most programmers use OOP), these function calls can't be statically analysed, inlined or dereferenced at compile time.
Posted on Reply
#72
arbiter
DemonicRyzen666How is it AMD glorified if intel is winning ?
Look up history of the game, it was funded by AMD. it means it will Over perform on amd hardware vs what would happen in other games that aren't coded for 1 side.
DemonicRyzen666Where do you see 5900x ?, this 5800X 8 core 16 thread vs 10 core 20 thread.
Read what i said i never said 5900x. Go back to OP images where it shows the 2 cpu's on right side with summary. There is 2 numbers that are Score that which intel cpu scored 5900 points and amd cpu scored 5800. How could amd win with higher fps but lower score?
Posted on Reply
#73
DemonicRyzen666
@ arbiter oh I missed that, because everyone was comparing Cpu frame rates.

@ efikkan I kept hearing him talk about switching in that video. I remember somethings about that is why AMD multi threading always ended feeling more responsive then Intel. It was something about Hitting ALT tab in windows while gaming, it just seems to be quicker at odd stuff like that.

@dragontammer5877 There are some benches that show there is some bottleneck with zen 2. Everyone says it's it's infinity fabric. The best way to get around the Infinity fabric bottleneck would be to add another link. If's it's only one link, because sometimes you got that lowly 3300X getting up in-between things like the 3900x and 3950x. We know that is usually, because it's a single CCX. Then again If the 3900x is ahead that would put it down to it having a larger cache ratio to cores.
Posted on Reply
#74
InVasMani
efikkanSure, down to single instructions can be slightly faster or slower on various architectures. In my tests, I've seen some cases where Haswell is slower than Sandy Bridge, but in most cases it's faster. The problem here is that this is a benchmark of a single operation in a loop, this is a synthetic test case which will exaggerate the real world difference. The reason why he runs the loop 1.000.000.000.000 times is to get a measurable difference. Also, it's not like these operations are different alternatives to solve the same problem. It's not unlikely that you can find older architectures which can do certain simple operations like this faster, while modern architectures are optimized for saturating several execution ports and doing a mix of various types of operations. This is why such benchmarks can be very misguiding.

When doing real optimization of code, it's common to benchmark whole algorithms or larger pieces of code to see the real world difference of different approaches. It's very rare that you'll find a larger piece of code that performs much better on Skylake and a competing alternative which performs much better on let's say Zen 2. Any difference that you'll find for single instructions will be less important than the overall improvements of the architecture. And it's not like there will be an "Intel optimization", Intel has changed the resource balancing for every new architecture, so has AMD.

Interestingly the sample code in that video scales poorly with many cores, but should be able to scale nearly linearly if the work queue is implemented smarter.


Instruction level parallelism is already heavily used, there is no need to spread the ALUs, FPUs, etc. across several cores, the distance would make a synchronization nightmare. We should expect future architectures to continue to scale their superscalar abilities. But I don't doubt that someone will find a clever way to utilize "idle transistors" in some of these by manipulating clock cycles etc.

The problem with superscalar scaling is keeping execution units fed. Both Intel and AMD currently have four integer pipelines. Integer pipelines are cheap (both in transistors and power usage), so why not double or quadruple them? Because they would struggle to utilize them properly. Both of them have been increasing instruction windows with every generation to try to exploit more parallelism, and Intel's next gen Sapphire Rapids/Golde Cove is allegedly featuring a massive 800 entry instruction window (Skylake has 224, Sunny Cove 352 for comparison). And even with these massive CPU front-ends, execution units are generally under-utilized due to branch mispredictions and cache misses. Sooner or later the ISA needs to improve to help the CPU, which should be theoretically possible, as the compiler has much more context than is passed on through the x86 ISA, as well as eliminating more branching.
Couldn't AMD take chip dies and use the I/O die modulate them much like system memory for double data rate or quadruple data rate to speed up single thread performance. They'd each retain their own cache so that itself is a perk of modulating between them in synchronized way controlled thru the I/O die to complete single thread task load. For all intents and purposes the CPU would behave as if it's a single faster chip. It could basically fill the L1 cache on one then swap to the next die and same with the L2 and L3 caches. In fact they synchronize each much like numerous latency timings. On top of that if you need multi-thread performance it could have some type of first serve access priority possibly based on condition criteria. It could be a bit like the windows setting for foreground/background tasks with time slices between single thread performance and multi-threaded performance that the I/O die manages and takes advantage of when it really need the multi-threaded performance.

The cache misses defiantly are harsh when they happen, but wouldn't automatically cycle modulating the individual L1/L2/L3 caches in different chip dies through the I/O die get around that? Cycle between the ones available basically. Perhaps they only do it with larger L2/L3 cache's though I mean maybe it doesn't make enough practical sense with the L1 cache being so small and switch times and such. Perhaps in a future design at some level or another I don't know.

Something else on the I/O die doing modulation switching between cores or die's at the core level in particular they could it based on poll chips and which ever can precision boosts the highest select that one for the single thread performance then poll it again after a set period and select whichever core gave the best results again and keep doing that approach. Basically no matter what it could always try to select the highest boost speed to optimize the single thread performance. Perhaps it does that between cores and die's as well so if one gets a little hot let it cool off while making use of the coolest die though switching between those might be less intermittent.
Posted on Reply
#75
dragontamer5788
efikkanOf course not, you will never be able to do that, that's not what I meant.
I was thinking of branching logic inside a single scope, like a lot of ifs in a loop. Compilers already turn some of these into branchless alternatives, but I'm sure there is more potential here, especially if the ISA could express dependencies so the CPU could do things out of order more efficiently and hopefully some day limit the stalls in the CPU. As you know, with ever more superscalar CPUs, the relative cost of a cache miss or branch misprediction is growing.
Ideally code should be free of unnecessary branching, and there are a lot of clever tricks with and without AVX, which I believe we have discussed previously.
Possibly, I think we have talked about this issue before.

Dependency management on today's CPUs and compilers is a well solved problem: "xor rax, rax" cuts a dependency, allocates a new register from the reorder buffer, and starts a parallel-calculation that takes advantage of super-scalar CPUs. Its a dirty hack, but it works, and it works surprisingly well. I'm not convinced that a new machine-code format (with more explicit dependency matching) is needed for speed.

I think the main advantage to a potential "dependency-graph representation" would be power-consumption and core-size. Code with more explicit dependencies encoded could have smaller decoders that use less power, leveraging information that the compiler already calculated (instead of re-calculating it from scratch, so to speak).

Modern ROBs are 200+ long already, meaning the CPU can search ~200 instructions looking for instruction-level parallelism. And apparently these reorder buffers are only going to get bigger (300+ for Icelake).
Posted on Reply
Add your own comment
Jun 29th, 2024 20:38 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts