Friday, December 2nd 2022
AMD Readies 16-core, 12-core, and 8-core Ryzen 7000X3D "Zen 4" Processors
AMD is firing full cylinders to release a new line of Ryzen 7000-series "Zen 4" Socket AM5 desktop processors featuring 3D Vertical Cache, at the earliest. Faced with a significant drop in demand due to the slump in the PC industry, and renewed competition from Intel in the form of its 13th Gen Core "Raptor Lake" processors, the company is looking to launch the Ryzen 7000X3D desktop processors within January 2023, with product unveiling expected at AMD's 2023 International CES event. The 3D Vertical Cache technology had a profound impact on the gaming performance of the older "Zen 3" architecture, bringing it up to levels competitive with those of the 12th Gen Core "Alder Lake" processors, and while gaming performance of the Ryzen 7000 "Zen 4" processors launched till take match or beat "Alder Lake," they fall behind those of the 13th Gen "Raptor Lake," which is exactly what AMD hopes to remedy with the Ryzen 7000X3D series.
In a report, Korean tech publication Quasar Zone states that AMD is planning to release 16-core/32-thread, 12-core/24-thread, and 8-core/16-thread SKUs in the Ryzen 7000X3D series. These would use one or two "Zen 4" chiplets with stacked 3D Vertical Cache memory. A large amount of cache memory operating at the same speed as the on-die L3 cache, is made contiguous with it and stacked on top of the region of the CCD (chiplet) that has the L3 cache, while the region with the CPU cores has structural silicon that conveys heat to the surface. On "Zen 3," the 32 MB on-die cache is appended with 64 MB of stacked cache memory operating at the same speed, giving the processor 96 MB of L3 cache that's uniformly accessible by all CPU cores on the CCD. This large cache memory positively impacts gaming performance on the Ryzen 7 5800X3D in comparison to the 5800X; and a similar uplift is expected for the 7000X3D series over their regular 7000-series counterparts.The naming of these 7000X3D series SKUs is uncertain. It's possible that the 16-core part is called the 7950X3D, and the 12-core part 7900X3D; but the 8-core part may either be called the 7700X3D or 7800X3D. Quasar Zone also posted some theoretical performance projections for the 7950X3D based on the kind of performance uplifts 3DV cache yielded for "Zen 3" in the 5800X3D. According to these, the theoretical 7950X3D would easily match or beat the gaming performance of the Core i9-13900K, which begins to explain why Intel is scampering to launch the faster Core i9-13900KS with a boost frequency of 6.00 GHz or higher. The report also confirms that there won't be a 6-core/12-thread 7600X3D as previously thought.
Source:
harukaze5719 (Twitter)
In a report, Korean tech publication Quasar Zone states that AMD is planning to release 16-core/32-thread, 12-core/24-thread, and 8-core/16-thread SKUs in the Ryzen 7000X3D series. These would use one or two "Zen 4" chiplets with stacked 3D Vertical Cache memory. A large amount of cache memory operating at the same speed as the on-die L3 cache, is made contiguous with it and stacked on top of the region of the CCD (chiplet) that has the L3 cache, while the region with the CPU cores has structural silicon that conveys heat to the surface. On "Zen 3," the 32 MB on-die cache is appended with 64 MB of stacked cache memory operating at the same speed, giving the processor 96 MB of L3 cache that's uniformly accessible by all CPU cores on the CCD. This large cache memory positively impacts gaming performance on the Ryzen 7 5800X3D in comparison to the 5800X; and a similar uplift is expected for the 7000X3D series over their regular 7000-series counterparts.The naming of these 7000X3D series SKUs is uncertain. It's possible that the 16-core part is called the 7950X3D, and the 12-core part 7900X3D; but the 8-core part may either be called the 7700X3D or 7800X3D. Quasar Zone also posted some theoretical performance projections for the 7950X3D based on the kind of performance uplifts 3DV cache yielded for "Zen 3" in the 5800X3D. According to these, the theoretical 7950X3D would easily match or beat the gaming performance of the Core i9-13900K, which begins to explain why Intel is scampering to launch the faster Core i9-13900KS with a boost frequency of 6.00 GHz or higher. The report also confirms that there won't be a 6-core/12-thread 7600X3D as previously thought.
153 Comments on AMD Readies 16-core, 12-core, and 8-core Ryzen 7000X3D "Zen 4" Processors
The 5600 is IMO, the best choice for 99% of gamers out there - you need a serious GPU to ever have it even be a limit. That's because it helps gaming and server applications, theres very little overlap.
because the 5800x3D did run at lower clocks (overall mine runs 4.45GHz all core vs 4.6GHz in AVX workloads, soooo much slower) it'd be hard to advertise it as an all purpose product when it'd have a deficit in some commonly used setups
What'd be amazing is if AMD had the pull intel does with microsoft, and could release a CPU with one 3D stacked die and others without - 3D becomes the P cores, and the others do the higher wattage boring workloads
With fast games pushing tick rates of 120 Hz(8.3ms) or higher, individual steps in the pipelined simulation can have a performance budget of 1 ms or much less, which leaves very small margins for delays caused by threading. At this fine level, splitting a task across threads may actually hurt performance, especially when it comes to stutter. It may even cause simulation glitches or crashes, as we've seen in some games. For this reason, even heavily multithreaded games usually do separate tasks on separate threads, e.g. ~2-3+ threads for GPU interaction, usually 1 for the core game simulation, 1 for audio, etc. Multithreading has been common in games since the early 2000s, and game engines still to this day have control over what to run on which cores. Current graphics APIs certainly can accept commands from multiple threads, but to what purpose? This just leaves the driver with the task to organize the commands, and it's shoved into a single queue of operations internally either way. The purpose of multiple threads with GPU context is to execute independent queues, e.g. multiple render passes, compute loads, asset loading, or possibly multiple viewports (e.g. split screen).
Usually, when a game's rendering is CPU bottlenecked, the game is either not running the API calls effectively or using the API "incorrectly". In such cases, the solution is to move the non-rendering code out of the rendering thread instead of creating more rendering threads. Data cache line hits in L3 from other threads are (comparatively) rare, as the entire cache is overwritten every few thousand clock cycles, so the window for a hit here is very tiny.
Instruction cache line hits from other threads in L3 is more common however, e.g. if multiple threads execute the same code but on different data. Actually, this misconception is why we see so little gains from 3D V-Cache. The entire L1/L2/L3 hierarchy is overwritten many times in the lifecycle of a single frame, so there is no possibility of benefits here across frames.
We only see some applications and games benefit significantly from massive L3 caches, and it's usually not the ones which are the most computationally intensive, this is because of instruction cache hits, not data. When software is sensitive to instruction cache, it's usually a sign of unoptimized or bloated code. For this reason I'm not particularly excited about extra L3 cache, as this mainly benefits "poor" code. But when the underlying technology eventually is used to build bigger cores with varying features, now that's exciting.
The data proof what i say. Most game today just map the topology of processor for things like CCD (to schedule it all in one CCD for Ryzen 9 by example) E-cores, SMT, etc. They do not really care if the job run on X core or Y. Else, they would have to code their game to run on every variation of the CPU. What if there is no SMT, what if there is just 4 core, etc.
And still, the main thread remain the bottleneck in most case. Else you would see 20-25% gain going from 6 core to 8 core in a CPU limited scenario and that is still not the case. Also, if thread were assigned staticly by the engine, the game would really run poorly on newer gen CPU with less cores. (like the 7600x). But still today, this CPU kick the ass for many other CPU with mores cores because it do not matter. Just raw performance matter. The driver do not decide what to do on it's own. The CPU still have to send commands to the drivers and this is the part that is now more easily multithreadable with modern API.
After that, all drivers are more or less multithreaded. One of the main advantage Nvidia had with older API where it was harder to do multithreading code was that their drivers was more multithreaded. But it also had a higher CPU overhead. AMD catched up in june with its older DirectX driver and OpenGL now. but anyway. Once the driver done it's job, all the rest is done by the GPU and the numbers of cores do not really matter. but is it? This is an oversimplification. It can be true in some case, but not always. This is a discourse that i hate to read. This is just plain false and most people do not really understand what a game is doing.
If we take your reasoning up to the end, this mean we could have a game with the perfect physics, perfect AI and be photorealistics with unlimited numbers of assets and if the game do not run properly, its just because the game do not use the API "Correctly" It's not a misconception at all. It can be true that L1/L2 are overwritten many time in the lifecycle of a single frame, but that is what they are intended for. They need to cache data for a very small amount of time. still in that time many cycles can happen.
The fact that cache are overwritten is not an issue. As long as they aren't when they are required. But anyway, CPU have mechanism to predict what need to be in cache, what need to stay in cache and etc. They look at the future code and see what portion of memory would be needed. One misconception that we may have is to think about cache and Data. Cache do not cache data. Cache cache Memory space.
The CPU detect what memory region it need to continue working and it will prefetch it. There are other mechanism in the CPU that will decide if some memory region need to remain cache because they are likely to be reused.
Having more cache allow you to be more flexible and balance those things quite easily. Also a CPU might still reuse some data hundreds of time before it get purged from the L3 cache.
The working set for a single frame must remain small anyway. In a ideal scenario, let say you have 60 GB/s, you want to have 60 FPS, that mean that if you can access all your data instaneously, you cannot have more than 1 GB of data read during that frame (unless indeed, if you cache it).
If you add the wait for all memory access on top of that (a typical access would be few bytes, and you would wait 50-70 ns to get it) it's probably more around 200-300 MB per frame.
You can see that with a larger cache, you can now cache a significant portion of it.
Really looking forward to seeing benchmarks and prices.
Delays may happen any time the OS scheduler kicks in one of the numerous background threads. Let's say you for example have a workload of 5 pipelined stages, each consisting of 4 work units, and a master thread will divide these to 4 worker threads. If each stage has to be completed by all threads and then synced up, then a single delay will delay the entire pipeline. This is why threadpools scale extremely well with independent work chunks and not well otherwise. While requesting affinity is possible, I serously doubt most games are doing that. That makes no sense. That's not how performance scaling i games work. My wording was incorrect, I meant to say "on which thread", not which (physical) core, which mislead you.
The OS scheduler clearly decides where to run the threads. The current graphics APIs are fairly low overhead. Render threads are not normally spending the majority of their CPU time with API calls. Developers will easily see this with a simple profiling tool. This is just silly straw man argumentation. I said no such thing. :rolleyes:
I was talking about rendering threads wasting (CPU) time on non-rendering work. L3 (in current Intel and AMD CPUs) are spillover caches for L2, meaning anything evicted from L2 will end up there. If a CPU has 100 MB L3 cache, and the CPU reads ~100 MB (unique data) from memory, then anything from L3 which hasn't been promoted into L2 will be overwritten after this data is read. It is a misconception that "important data" remains in cache. The prefetcher does not preserve/keep important cache lines in L3, and for anything to remain in cache it needs to be reused before it's evicted from L3 too. Every cache line the prefetcher loads into L2 (even one which is ultimately not used) will evict another cache line. Considering the huge amount of data the CPU is churning through, even all the data it ultimately prefetches unnecessarily, the hitrate of extra L3 cache is falling very rapidly. Other threads, even low priority background threads, also affects this. While one or more threads are working on the rendering, other thread(s) will process events, do game simulation, etc. (plus other background threads), all of which will traverse different data and code, which will "pollute" (or rather compete over) the L3.
What you describe is how the OS would switch between thread if there is limited ressource (no more core/SMT available) and it need to switch between thread to make the job faster.
If OS were really that slow to assign thread, everything you do on your computer would feel awfully slow. That would leave so much performance on the table. It's not affinity but knowing How much thread they need to launch to get some balance. This indeed make no sense for game scaling. But this is how it work for things that aren't bottleneck by a single thread like 3d rendering. Agree! They are lower overhead but that do not means null. Also having a lower overhead allow you to do more with the same. That was unclear, but if a game indeed waste too much time on useless things, it is indeed bad... You are right that L3 is a victim cache and will contain previously used data (that may or may not have been prefetch by the prefetcher.) Having Larger L3 cache allow you to have more aggressive prefetcher without too much penalty.
The benefits of cache will not be linear indeed. But tripling it will give a significant performance gain when the code re-use frequently the same data. And this is what game usually do. But your concern are also totally valid because there are many other application that will just load new data all the time and barely re-use the data making the L3 cache almost useless.
But at the processor scale, (and not human time frame), in the best scenario where each of the 6 core read the from the memory at 60 GB/s, it will take about 1.5 ms to flush the cache with fresh data. 60 GB/s is about what DDR4-3800 would give you in theorical bench. If we use Fast DDR5 like DDR5-6000, that would be close to 90GB/s meaning the cache would be flushed every 1 ms.
This is indeed close to the timeframe a game use to render a frame. And this is a very theorical scenario. Also a read to main memory will be between 45-100 ns where a hit to L3 with only take 10 ns. The bandwidth is also 10 times faster. You do not need to have a very high hit ratio in the period of 16.6 ms to get significant gain. And all the cache hit leave the memory controller free to perform more memory access.
Even if you are doing more task, background task, etc, you will still have cache hit at some point Well a lot of that data will reuse the same data even within a thread making it worthwhile to have L3. And the thing is L3 do not cache "Things" it cache memory line. Memory line cache instruction and data. Those instruction can be reused as part of a loop many times. smaller loops will all happen within registry or L1, but some can take more time (and/or have branch making them longer to run) and benefits from having L2 and L3.
Again, Data prove this point. 5800X3D, Even if it clock slightly lower than 5800x Beat it at almost all game
There are 2 scenario where the extra cache in gaming will not be beneficial (In CPU limited scenario)
1. Too large working set. In those case, due to memory latency, you will be talking at game with very low FPS and memory bandwidth limited.
2. Game with way smaller datasets that could fit or almost fit in the 32 MB L3 of the 5800X making the extra cache useless.
In all reality, I will probably still wait for next gen 8950X3D just to get past the first-AM5-gen hiccups and get better ecosystem stability. My R7 2700/GTX 1070 is basically adequate for now. :twitch: I am used to the issues it has with my workflows and can hopefully get by with it until I can send it big for a whole new system (actually need to replace the 1070 soon. :cry: Not sure it can keep going too much longer). Hopefully an RX 8950 XTX is available then too and I can do a matching system of 8950's... that would make my brain happy. :laugh:
We'll see... looking forward to RDNA3 in a few days and will see how things shape up for my video business in the new year. It is a side gig right now, but it's building.
I know you were specifically referencing "frametimes" but i thought the above was necessary for the less informed reader. As for frametime analysis:
[URL='https://www.techpowerup.com/review/amd-ryzen-5-7600x/23.html'][SIZE=4][U]7600X frametime analysis[/U][/SIZE][/URL]
[URL='https://www.techpowerup.com/review/amd-ryzen-7-7700x/23.html'][SIZE=4][U]7700X frametime analysis[/U][/SIZE][/URL]
I don't see anything of significance to axe 6-core gaming parts as "silly" in "2023". Same goes for previous Gen 5600X/12600K.... still fantastic for getting your feet wet with smooth gameplay. Moreover, not everyone is willing to fork out $400-$500 for a flagship gaming chip regardless of the performance feat.... a more affordable 6 core X3D would have been nice and a nice AM5 get-on-board tempting green card to an already expensive DDR5 AM5 platform. I get the feeling the 7600X3D will be something expected a little later after AMD's exhausted current elevated Zen 3 sales (5600~5800X3D)Think of the amount of data vs. amount of code that the CPU is churning through at a given timeframe. As you hopefully know, most games have vastly less code than memory allocated, and even with system libraries etc., data is still much larger than code. On top of this, any programmer will know most algorithms execute code paths over and over, so the chances of a given cache line getting a hit is at least one order of magnitude higher if the cache line is code (vs. data), if not two. On top of this, the chances of a core getting a data cache hit from another core is very tiny, even much smaller than a data cache hit from it's own evicted data cache line. This is why programmers who know low level optimization knows L3 cache sensitivity often has to do with instruction cache, which in turn is very closely tied to the computational density of the code (in other words; how bloated the code is). This is why so few applications get a significant boost from loads of extra L3, they are simply too efficient, which is actually a good thing, if you can grasp it.
And to your idea of running entire datasets in L3, this is not realistic in real world scenarios on desktop computers. There are loads of background threads which will "pollute" your L3, and even just a web browser with some idling tabs will do a lot. For your idea to "work" you need something like an embedded system or controlled environment. Hypothetically, if we had a CPU with L3 split in instruction and data caches, now that would be a bit more interesting.
We are in the process of watching AMD stupidly snatch defeat from the jaws of victory. :kookoo:
But some will, and they are going to get their high margins, which is the only thing they care about.
They do not want to sell 6-core CPUs, because they are using 8-core chiplets. And I doubt many of those super tiny chiplets are defective, probably only some of the ones on the very edge of the wafer.
A 7600X3D would hinder the sales of the 8-core X3D model. They would sell millions of them, but their margins would be super low, and that goes against everything corporations believe in.
Every now and then each company will have a product with unbelievable value, but it will never become a common thing, because the big heads will not allow it.
You're essentially suggesting anyone purchasing these chips has no intention on gaming? You can't ignore MT specialists being as human as you and I - theres a nice chunk of these workstation stroke gaming pundits who will definitely buy into these chips. I myself would fancy something like a 7900X3D for gaming, transcoding and the occasional video rendering....although since these non-gaming tasks are irregular i'd save my money and grab a 7700X3D (not that i'm looking to buy into AM5 unless DDR5/MOBO prices shrink).
I saw the x3d is good in Flight Simulator.
I have an A320 motherboard because I came across one that was so hilarious that I had to buy it. Canada Computers had the Biostar A320M on clearance for only CA$40! My first build back in 1988 had a Biostar motherboard so I have a soft spot in my heart for them. This board looked so pathetic that I just couldn't say no to it. It was actually kinda cute:
It doesn't even have an M.2 port! The thing is though, it cost CA$40, and IT WORKS! You can't go wrong there. Well those big heads need to be given a shake because that's how you gain an advantageous market position. It's the difference between strategy and tactics. Just look at what nVidia did with the GTX 1080 Ti. Was it a phenomenal value? Absolutely! Is it one of the major reasons why nVidia is seen as the market leader? Absolutely! The long-term benefits typically outweigh the short-term losses. This isn't some new revelation either so those "big heads" must be pretty empty if they can't comprehend this. I do everything and anything on my PC too. The thing is, the most hardware-hardcore tasks that I do are gaming tasks. For everything else, hell, even my old FX-8350 would probably be fast enough. It's not like Windows or Firefox have greatly increased their hardware requirements in the last ten years. Yep. From what I understand the R7-5800X3D is currently the fastest CPU for FS2020. God only knows what kind of performance a Zen4 X3D CPU would bring. At the same time, I don't think that any performance advantage over the 5800X3D would be anything of significant value because it already runs FS2020 perfectly and you can't get better than perfect. So you opened his eyes to what is possible! See, I love reading things like this and while I'm glad that you did what you did (because his life may never be the same now, which is a good thing), it also serves to underscore just how oblivious that a lot of workstation users are when it comes to gaming on a PC. I believe that most of the 12 and 16-core parts get bought by businesses for their own workstations (assuming that they don't need anything like Threadripper or EPYC) and gaming is far from their desired use. In fact, I'd be willing to bet that Zen4's IGP makes those CPUs even more attractive to businesses because how much GPU power does an office app use? I agree. The R5-5600 is a tremendous value, but only in the short-term. In the long-term, the R7-5800X3D will leave it in the dust. It's like back when people were scooping up the R3-3100X faster than AMD could make them. They were an unbelievable value at the time, but, like most quad-cores, their legs proved to be pretty short over the years. Well, here's the catch... Technically, ALL x86 CPUs are all-purpose products. There's nothing that the R9-7950X can do that my FX-8350 can't. It's just a matter of how fast and so for certain tasks, there are CPUs that are better than others. However, that doesn't change the fact that they can all do everything that any other x86 CPU can do. That does meet my definition of all-purpose.
Q: Can an R7-5800X3D run blender?
A: Yep.
Q: Can it run ARM applications?
A: Yep.
Q: Can it run Adobe Premiere?
A: Yep.
Q: Can it run a virtual machine?
A: Yep.
Q: Can it run MS-Office or LibreOffice?
A: Yep.
Q: Can it run a multimedia platform?
A: Yep.
Q: Can it run a local server?
A: Yep.
Q: Can it run 7zip?
A: Yep.
Q: Can it run DOOM?
A: Tell me one thing that CAN'T run DOOM!
And finally...
Q: CAN IT RUN CRYSIS?
A: Yep.
Ok, so yeah, it's an all-purpose CPU! That's why I think AMD made a serious screwup by not having a hexacore 3D CPU. The reason that Intel has that much pull with MS is the fact that most Windows PCs still have an Intel CPU at their heart. If AMD managed to conquer the gamer market with an R5-7600X3D, then they too would have considerable pull with Microsoft, far more than they have now anyway.
I do however agree that an R5-5600X3D would've made more sense than the R7-5800X3D. The thing is, I gave them a pass because it was their first "kick at the can" so to speak. I won't give them the same pass with AM5 because the 5800X3D showed that they clearly got it right. It's not just that they're being greedy, it's that they're being stupid. I really don't think that the 3D cache has enough of a positive impact on productivity that the productivity users will be willing to pay extra for it and they'll just gather dust like the RTX 4080 and RX 7900 XT. That's not good for anybody.
The trouble with a 6-core X3D, 5K or 7K series, is that the X3D bit commands a $150 price premium - would you have bought a $400 6-core? I think enough people were complaining about the low core count/MT performance of the 5800X3D.