• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Die-shot Suggests "Phoenix 2" is AMD's First Hybrid Processor

Joined
Jun 14, 2020
Messages
4,137 (2.45/day)
System Name Mean machine
Processor AMD 6900HS
Memory 2x16 GB 4800C40
Video Card(s) AMD Radeon 6700S
What's different is that a Zen 4c core will behave exactly like a Zen 4 core when at the same clock speed (except maybe for cache, I'm not sure). So if the Zen 4 cores are loaded, a third Zen 4 core couldn't boost enough to outpetform a Zen 4c core. So there shouldn't be any difference in performance. Now Windows does have to schedule with to the Zen 4 cores first, but that's trivial. Remember that because if simultaneous multi-threading, this processor presents 12 cores to Windows, and Windows has to choose just one thread on each core until it gets to 7 threads. I don't hear concern about how that's working.
If the Zen 4c core would perform as well as the full fat core then there wouldn't be any full fat cores. Obviously that is not the case, zen 4c will be slower so it will have the same "issues" ecores do.
 
Joined
Mar 17, 2017
Messages
97 (0.03/day)
Location
Europe
Processor Ryzen 9 9950X
Motherboard X670 chipset
Cooling Arctic Liquid Freezer III 240
Memory 64 GiB
Video Card(s) RX 6700XT
Storage WD Black SN750, Seagate FireCuda 530, Samsung SSD 850 Pro, WD Blue HDD, Seagate IronWolf HDD
Display(s) Samsung (4K, FreeSync)
Case Phanteks NEO Air
Power Supply EVGA 750 B5
Mouse Eternico wireless mouse
Keyboard HyperX Alloy Origins Core Aqua with Corsair Onyx Black keycaps
Software Linux + KVM
And how is this different in practice? If a workload decides to load the C core instead of the full fat core, the end result is the same
A factor that makes practical difference is that [when the number of tasks running on the CPU is higher than the number of P-cores (irrespective of whether it is an AMD or an Intel P-core)] whether another task scheduled on an already half-used P-core (which has HT/SMT) will run faster or slower compared to scheduling the task on an unused E-core, while taking into account the fact that running a 2nd thread on a P-core will slow down the 1st thread on the P-core by about 40%. Instead of time, the question can also be reformulated in terms of power usage. Given the numbers and performance ratios, the usual core allocation order (when optimizing for time and not optimizing for power usage) is the following: 1 thread per P-core, then 1 thread per E-core, then the 2nd threads on P-cores, then the 2nd threads on E-cores (if the E-cores have HT/SMT). If optimizing for power usage instead of time, the allocation order is different. The allocation order can also be different if the application (or the OS) knows that certain threads running on the CPU are heavily sharing data via L1D / L1I / L2 caches.
 
Joined
Nov 26, 2021
Messages
1,738 (1.50/day)
Location
Mississauga, Canada
Processor Ryzen 7 5700X
Motherboard ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling Noctua NH-C14S (two fans)
Memory 2x16GB DDR4 3200
Video Card(s) Reference Vega 64
Storage Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s) Nixeus NX-EDG27, and Samsung S23A700
Case Fractal Design R5
Power Supply Seasonic PRIME TITANIUM 850W
Mouse Logitech
VR HMD Oculus Rift
Software Windows 11 Pro, and Ubuntu 20.04
A factor that makes practical difference is that [when the number of tasks running on the CPU is higher than the number of P-cores (irrespective of whether it is an AMD or an Intel P-core)] whether another task scheduled on an already half-used P-core (which has HT/SMT) will run faster or slower compared to scheduling the task on an unused E-core, while taking into account the fact that running a 2nd thread on a P-core will slow down the 1st thread on the P-core by about 40%. Instead of time, the question can also be reformulated in terms of power usage. Given the numbers and performance ratios, the usual core allocation order (when optimizing for time and not optimizing for power usage) is the following: 1 thread per P-core, then 1 thread per E-core, then the 2nd threads on P-cores, then the 2nd threads on E-cores (if the E-cores have HT/SMT). If optimizing for power usage instead of time, the allocation order is different. The allocation order can also be different if the application (or the OS) knows that certain threads running on the CPU are heavily sharing data via L1D / L1I / L2 caches.
Optimizing for power is more complex especially with Intel's E cores which are less power efficient than P cores for many tasks.

1694443361019.png
 
Joined
Feb 18, 2005
Messages
5,931 (0.81/day)
Location
Ikenai borderline!
System Name Firelance.
Processor Threadripper 3960X
Motherboard ROG Strix TRX40-E Gaming
Cooling IceGem 360 + 6x Arctic Cooling P12
Memory 8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s) MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage 2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s) Dell S3221QS(A) (32" 38x21 60Hz) + 2x AOC Q32E2N (32" 25x14 75Hz)
Case Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply Fractal Design Ion+ 2 Platinum 760W
Mouse Logitech G604
Keyboard Razer Pro Type Ultra
Software Windows 10 Professional x64
If the Zen 4c core would perform as well as the full fat core then there wouldn't be any full fat cores. Obviously that is not the case, zen 4c will be slower so it will have the same "issues" ecores do.
It will almost certainly be slower in some scenarios. What those scenarios are we don't know yet, and they may be irrelevant to most users.
 
Joined
Mar 17, 2017
Messages
97 (0.03/day)
Location
Europe
Processor Ryzen 9 9950X
Motherboard X670 chipset
Cooling Arctic Liquid Freezer III 240
Memory 64 GiB
Video Card(s) RX 6700XT
Storage WD Black SN750, Seagate FireCuda 530, Samsung SSD 850 Pro, WD Blue HDD, Seagate IronWolf HDD
Display(s) Samsung (4K, FreeSync)
Case Phanteks NEO Air
Power Supply EVGA 750 B5
Mouse Eternico wireless mouse
Keyboard HyperX Alloy Origins Core Aqua with Corsair Onyx Black keycaps
Software Linux + KVM
It is considered to be complex today (year 2023). A major issue is that most operating systems aren't designed to measure total task power. If they were able to measure it, optimizing for total task consumption would be very simple from end-user perspective (i.e: just a few key presses or mouse button clicks to turn such an optimization target on/off). Sometime in the future, it will be simple.
 
Joined
Jun 1, 2021
Messages
311 (0.23/day)
A factor that makes practical difference is that [when the number of tasks running on the CPU is higher than the number of P-cores (irrespective of whether it is an AMD or an Intel P-core)] whether another task scheduled on an already half-used P-core (which has HT/SMT) will run faster or slower compared to scheduling the task on an unused E-core, while taking into account the fact that running a 2nd thread on a P-core will slow down the 1st thread on the P-core by about 40%. Instead of time, the question can also be reformulated in terms of power usage. Given the numbers and performance ratios, the usual core allocation order (when optimizing for time and not optimizing for power usage) is the following: 1 thread per P-core, then 1 thread per E-core, then the 2nd threads on P-cores, then the 2nd threads on E-cores (if the E-cores have HT/SMT). If optimizing for power usage instead of time, the allocation order is different. The allocation order can also be different if the application (or the OS) knows that certain threads running on the CPU are heavily sharing data via L1D / L1I / L2 caches.
Most of what you say is already applicable to systems with only one core type. Note that a lot of thing that you take as assumption isn't necessarily true too.

You talked about a P-core that is 'half-used', in which you seem to be talking about one thread running in one of the two hardware threads. Will scheduling a new task on the other hardware thread end up harming the performance of the first? Maybe.

There is actually no way of knowing without analyzing exactly what each of those threads are doing. For all we know, the first thread might be stalled because it's accessing some peripheral(say a SSD) that has a latency in the orders of microseconds(note clocks are in nanoseconds or less range for GHz), if it's already not using the core resources then it likely won't matter. Same thing for say, the first thread doesn't use FP/Vector code and the second one does use it very heavily. Or etc etc, threads can end up stalling for whatever reason(and OoO execution tries to hide that but it's not perfect) and so one thread basically has all core resources for some time.

Also, it's not as simple as your solution seems to suggest, you are just locking threads on core types as they are spawned. A good reason to start with is that any modern OS will have more than a hundred threads running at the same time...

Well, there is a lot of reasons, but this is already too long.
 
Joined
Mar 17, 2017
Messages
97 (0.03/day)
Location
Europe
Processor Ryzen 9 9950X
Motherboard X670 chipset
Cooling Arctic Liquid Freezer III 240
Memory 64 GiB
Video Card(s) RX 6700XT
Storage WD Black SN750, Seagate FireCuda 530, Samsung SSD 850 Pro, WD Blue HDD, Seagate IronWolf HDD
Display(s) Samsung (4K, FreeSync)
Case Phanteks NEO Air
Power Supply EVGA 750 B5
Mouse Eternico wireless mouse
Keyboard HyperX Alloy Origins Core Aqua with Corsair Onyx Black keycaps
Software Linux + KVM
Most of what you say is already applicable to systems with only one core type. Note that a lot of thing that you take as assumption isn't necessarily true too.

You talked about a P-core that is 'half-used', in which you seem to be talking about one thread running in one of the two hardware threads. Will scheduling a new task on the other hardware thread end up harming the performance of the first? Maybe.

There is actually no way of knowing without analyzing exactly what each of those threads are doing. For all we know, the first thread might be stalled because it's accessing some peripheral(say a SSD) that has a latency in the orders of microseconds(note clocks are in nanoseconds or less range for GHz), if it's already not using the core resources then it likely won't matter. Same thing for say, the first thread doesn't use FP/Vector code and the second one does use it very heavily. Or etc etc, threads can end up stalling for whatever reason(and OoO execution tries to hide that but it's not perfect) and so one thread basically has all core resources for some time.

Also, it's not as simple as your solution seems to suggest, you are just locking threads on core types as they are spawned. A good reason to start with is that any modern OS will have more than a hundred threads running at the same time...

Well, there is a lot of reasons, but this is already too long.
The above arguments seem obvious from my perspective. So, I agree.
 
Joined
Mar 12, 2009
Messages
1,146 (0.20/day)
Location
SCOTLAND!
System Name Machine XX
Processor Ryzen 7600
Motherboard MSI X670E GAMING PLUS
Cooling 120mm heatsink
Memory 32GB DDR5 6000 CL30
Video Card(s) RX5700XT 8Gb
Storage 280GB Optane 900p
Display(s) 19" + 23" + 17"
Case ATX
Audio Device(s) Soundblaster Z
Power Supply 800W
Software Windows 11
Notice the same clock speed. This isn't going to have the same clock speed at all, if AMD could do a core that was half the size and had roughly the same clock speed, they would just do that...

If the 2 Zen 4 cores are loaded and the program needs more cores then the Zen4C will be the bottleneck just like how Gracemont is for Golden Cove. Or there might be cases like the OS schedules tasks to Zen4C(which will boost to the highest clock) instead of Zen4 cores

The physical implementation is different anyhow too, so how knows the effect of stuff like the different Memory Cell that they are using for Zen4C.

My understanding is that windows normally wants to send the most resource hungry process to the core with the highest clock speed, with the 7950x3d it was a problem because the non v-Cache cores ran faster than the ones with the cache, that's a totally different scenario to this where the zen4 cores will always clock faster than the zen4c cores so will always be preferred.
 
Joined
Aug 12, 2022
Messages
253 (0.28/day)
If the Zen 4c core would perform as well as the full fat core then there wouldn't be any full fat cores. Obviously that is not the case, zen 4c will be slower so it will have the same "issues" ecores do.
Zen 4 is a balanced core; it is designed to get the best of power efficiency, density, and clock speed from 15W notebooks to 170W desktops. Zen 4C has the same logic but re-arranged for power efficiency and density. The result is that at a given lower clock speed, it needs less power than Zen 4 at that same clock speed, but it performs the same. But it consumes more power at higher clock speeds, so it can't clock as high. A single Zen 4 core can consume 15W, so in a 15W laptop processor, two Zen 4 cores cannot reach their maximum clock speed simultaneously. So as the number of cores in use goes up, the clock speed goes down to stay within the power and heat limits of the laptop. At some point, the clock speed will go down to a speed where Zen 4 and Zen 4c are equally efficient. Below that point, Zen 4c will actually be able to clock higher in the same power limit, or use less power so that Zen 4 can keep up.

So in a 15W laptop processor, it's quite possible and I think more likely than not that two Zen 4 plus four Zen 4c will be faster in most tasks than six Zen 4 cores.
 
Joined
Apr 12, 2013
Messages
7,613 (1.77/day)
If the 2 Zen 4 cores are loaded and the program needs more cores then the Zen4C will be the bottleneck just like how Gracemont is for Golden Cove. Or there might be cases like the OS schedules tasks to Zen4C(which will boost to the highest clock) instead of Zen4 cores
That's only because of the alleged low clock speeds of zen4c on servers, we don't know how they'll clock on desktops if they're even released there.
The physical implementation is different anyhow too, so how knows the effect of stuff like the different Memory Cell that they are using for Zen4C.
Why does it matter how they implement the cores? They could throw in a fake 4D effect for all I care ~ AMD Ryzen Z1 APU Features Zen 4c Cores

The only thing that matters is the performance & with a shared L3 it looks like there could be minimal difference there!
 
Joined
Jun 1, 2021
Messages
311 (0.23/day)
Why does it matter how they implement the cores? They could throw in a fake 4D effect for all I care ~ AMD Ryzen Z1 APU Features Zen 4c Cores

The only thing that matters is the performance & with a shared L3 it looks like there could be minimal difference there!
Because physical implementation is part of the performance. There is a lot of details in physical implementation, like say register duplication, which is in many cases much faster than without as you would be reducing the critical paths, routing and etc.

There is very likely a reason why they have chosen to only use the 6T pseudo-dual port memory cell for Zen 4c and not the normal cores. Those details are likely going to bring clocks considerably down.

About shared L3, it can also depend if they are doing a single CCX or not. I remember that there were stuff about a 4 cores Zen 5 and 8 cores Zen 5c APU(or was it Zen 4 variants for both?) that had each of the core types in one CCX. The new memory cells might also be of higher latency too, it doesn't contradict with anything AMD said about Zen 4c afaik since that isn't viewed as part of the architecture and they only said the architecture remains the same.

Either way, the result is likely that it isn't going to clock anywhere as high as normal Zen 4 and it's also going to have different V/F curves.
 
Joined
Apr 12, 2013
Messages
7,613 (1.77/day)
And AMD can simply artificially limit the zen4 clocks if they really need to, though that would defeat the purpose of going this route in essence. Also for the current performance targets, for Z1 APU, the clocks are perfectly reasonable.

What you're saying is also not unheard of, in fact *dozer probably had some of these higher density variants on 28nm(?) IIRC which were supposedly designed to save space. It was around that timeline, though I'm not 100% sure if it were the same products.
 
Joined
Mar 6, 2018
Messages
137 (0.05/day)
And AMD can simply artificially limit the zen4 clocks if they really need to, though that would defeat the purpose of going this route in essence. Also for the current performance targets, for Z1 APU, the clocks are perfectly reasonable.

What you're saying is also not unheard of, in fact *dozer probably had some of these higher density variants on 28nm(?) IIRC which were supposedly designed to save space. It was around that timeline, though I'm not 100% sure if it were the same products.
Saving die space and thus current leakage.
 
Joined
Oct 28, 2012
Messages
1,247 (0.28/day)
Processor AMD Ryzen 3700x
Motherboard asus ROG Strix B-350I Gaming
Cooling Deepcool LS520 SE
Memory crucial ballistix 32Gb DDR4
Video Card(s) RTX 3070 FE
Storage WD sn550 1To/WD ssd sata 1To /WD black sn750 1To/Seagate 2To/WD book 4 To back-up
Display(s) LG GL850
Case Dan A4 H2O
Audio Device(s) sennheiser HD58X
Power Supply Corsair SF600
Mouse MX master 3
Keyboard Master Key Mx
Software win 11 pro
No, it isn't.

Moose muffins. There absolutely is a difference. Just because you don't understand the difference doesn't mean there isn't one. However, it is VERY complicated and I'm not going to take the time to explain it.
I just feel likes this whole debate will ultimately depends on whether or not AMD decides to limit the clock of zen 4C vs classic zen 4.
 
Joined
Jul 5, 2013
Messages
28,779 (6.81/day)
I just feel likes this whole debate will ultimately depends on whether or not AMD decides to limit the clock of zen 4C vs classic zen 4.
It seems they have already done that. The thing is, the cores are seemingly electrically and functionally the same, just clock limited. With Intel's Big/Little, the P-Cores are functionally different from the E-Cores. The E-Cores are an enhanced Atom generation CPU core, were as the P-Cores are the new hotness. Windows has to have a different set of runtimes for one core VS the other on Intel, where-as Windows does not need to do anything different for this new Ryzen CPU as the cores are functionally the same, just at different speeds, something far easier to manage.
 
Joined
Aug 12, 2022
Messages
253 (0.28/day)
Moose muffins. There absolutely is a difference. Just because you don't understand the difference doesn't mean there isn't one. However, it is VERY complicated and I'm not going to take the time to explain it.
The instruction set architecture or ISA is the language right processor and software talk to one another in. Software either uses a subset of the ISA that's common to all current x86 processors or it checks what instructions are available and uses what it can. Windows will move software around between P cores and E cores while the software is running, so if the software starts in a P core and starts using AVX-512, then gets moved to an E core that doesn't have it, it'll probably crash. Intel disabled all instructions in Alder Lake that weren't available to both Golden Cove (P core) and Gracemont (E core) to avoid this issue. So they have an identical ISA.

What's different is their microarchitecture, which is the inner workings of the CPU core. Zen 4 and Zen 4c on the other hand have nearly identical microarchitectures. They're differentiated by tracing, layout, and cache, with AMD claiming that the end result is identical performance when at the same clock speed. But a Golden Cove core is a lot faster than a Gracemont core when running at the same clock speed.
 
Joined
Jun 1, 2021
Messages
311 (0.23/day)
Moose muffins. There absolutely is a difference. Just because you don't understand the difference doesn't mean there isn't one. However, it is VERY complicated and I'm not going to take the time to explain it.
No, there isn't. The ISA/feature level is exactly the same, it runs the same code.

If you are talking about performance, then yes, for sure, there is a huge difference.

Are you talking about specific MSRs? That does seem to have a difference, specially for the performance counters but that isn't meaningful in the discussion. The difference we are talking about is what it can execute or not at different performance levels, sure a Golden Cove is going to be much faster than a Gracemont, but so is going to be Zen 4 vs Zen 4c, which is where the complication of scheduling arrives.


Noting that people have been implementing the same core but in different ways for quite some time. There are plenty of A53 SoCs that have two clusters running at different clocks.
 
Last edited:
Joined
Jan 3, 2021
Messages
3,737 (2.52/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
I just feel likes this whole debate will ultimately depends on whether or not AMD decides to limit the clock of zen 4C vs classic zen 4.
It seems they have already done that. The thing is, the cores are seemingly electrically and functionally the same
Sure, AMD did limit the clock of the 4C core, but they did it at the design stage, not only after the processor was powered on. Larger transistors = faster transistors, at least on average, because the necessary size of each transistor depends on its role in the circuit. If its role is to drive some signal to many other transistors and/or over longer wires and/or at a higher speed, it needs to be larger in order to overcome the capacitances in the circuit. I'm posting a link to an article by David Kanter here again, it's not an easy read but it is very informative.
High-speed designs like the server processor tend to use more custom circuit design and larger transistors that have greater drive strength and reduced variability. In modern FinFET-based designs, this translates into more transistors with 2 fins, 3 fins, or even more. In contrast, lower-speed logic like an explicitly parallel GPU or ASICs often employ the densest transistors that use just a single fin, sacrificing clock speed to improve density. Similar to high-speed logic, ultra-low leakage transistors are often larger as well.

There is very likely a reason why they have chosen to only use the 6T pseudo-dual port memory cell for Zen 4c and not the normal cores.
They did that for the L2 cache IIRC, right?
 
Joined
Jun 14, 2020
Messages
4,137 (2.45/day)
System Name Mean machine
Processor AMD 6900HS
Memory 2x16 GB 4800C40
Video Card(s) AMD Radeon 6700S
It seems they have already done that. The thing is, the cores are seemingly electrically and functionally the same, just clock limited. With Intel's Big/Little, the P-Cores are functionally different from the E-Cores. The E-Cores are an enhanced Atom generation CPU core, were as the P-Cores are the new hotness. Windows has to have a different set of runtimes for one core VS the other on Intel, where-as Windows does not need to do anything different for this new Ryzen CPU as the cores are functionally the same, just at different speeds, something far easier to manage.
So if a thread is sent to the 4c core, performance will suffer, just like with ecores. Yes?
 
Joined
Jan 3, 2021
Messages
3,737 (2.52/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
So if a thread is sent to the 4c core, performance will suffer, just like with ecores. Yes?
Yes, and in a way, it's worse: performance will suffer even more if two threads are sent to a 4c core when it's not absolutely necessary. That is, when you have another free 4c core but you choose to keep it idle to save power.
 
Joined
Nov 26, 2021
Messages
1,738 (1.50/day)
Location
Mississauga, Canada
Processor Ryzen 7 5700X
Motherboard ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling Noctua NH-C14S (two fans)
Memory 2x16GB DDR4 3200
Video Card(s) Reference Vega 64
Storage Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s) Nixeus NX-EDG27, and Samsung S23A700
Case Fractal Design R5
Power Supply Seasonic PRIME TITANIUM 850W
Mouse Logitech
VR HMD Oculus Rift
Software Windows 11 Pro, and Ubuntu 20.04
I think there's much ado about nothing here. Regular processors, even ones with the same cores such as the 12600k or AMD's entire Ryzen portfolio before this SKU, already have different maximum clock speeds. In this case, it looks like the Zen 4 and Zen 4c cores share the same L3. If that's the case, then there will be no difference in IPC. However, the Zen 4c cores will clock lower than the Zen 4 cores which is a consequence of their physical design. In a power constrained scenario, that's unlikely to matter as these cores will only have threads scheduled onto them if the Zen 4 cores are occupied. In that case, the entire SOC will be running below peak clocks. Remember that Windows allocates threads in this manner:
  1. Cores get threads allocated in order of their speed with thread 1 going to core 0, thread 2 going to core 1, and so on
  2. Once the number of threads in the active task reaches the number of cores, then simultaneous mulithreading kicks in and again threads are allocated in order of core speed
As can be seen from the above, for a hypothetical process that can utilize 12 threads, it'll get scheduled onto the Zen 4 cores first (2 threads) and then 4 threads will be scheduled onto the Zen 4c cores. The remaining 6 threads will also be scheduled in a similar fashion. If it spawned only 2 threads, then they will be scheduled onto the Zen 4 cores.
 
Joined
Aug 12, 2022
Messages
253 (0.28/day)
So if a thread is sent to the 4c core, performance will suffer, just like with ecores. Yes?
Theoretically, if Zen 4c is more power-efficient at low power and if in a low-power device, it could be faster than Zen 4. Especially if the two Zen 4 cores are already busy.
 
Top