Wednesday, April 28th 2021
AMD Zen 5 "Strix Point" Processors Rumored To Feature big.LITTLE Core Design
AMD launched the 7 nm Zen 3 microarchitecture which powers Ryzen 5000 processors in late 2020, we expect AMD to follow this up with a Zen 3+ on 6 nm later this year and a 5 nm Zen 4 in 2022. We are now beginning to receive the first rumors about the 3 nm Zen 5 architecture which is expected to launch in 2024 in Ryzen 8000 series products. The architecture is reportedly known as "Strix Point" and will be manufactured on TSMC's 3 nm node with a big.LITTLE core design similar to the upcoming Intel Alder Lake and the Apple M1. The Strix Point lineup will consist exclusively of APUs and could feature up to 8 high-performance and 4 low-performance cores which would be less than what Intel plans to offer with Alder Lake. AMD has allegedly already set graphics performance targets for the processors and that they will bring significant changes to the memory subsystem but with rumors for a product 3 years away from launch take them with a healthy dose of skepticism.
Sources:
MEOPC, Video Cardz
78 Comments on AMD Zen 5 "Strix Point" Processors Rumored To Feature big.LITTLE Core Design
They aren't doing one with audio jack and one without either.
Giving customers choice is really not Apple style, so what was your point again?
Targeting different cores based on the type of instruction isn't impossible, it has been done to some degree in recent SoCs. I know that apparently Samsung's chips are capable of tracking and executing threads that contain the 64 bit ISA side of instructions on the "big" cores and 32 bit ones on the the "middle" and "small". To what extent that actually happens in practice I don't know, my guess is that not a whole lot since the scheduling and context switching could be very expensive. That's not the purpose of the little cores, they're not meant to offer the highest degree of energy efficiency. They're role is to simply offer lower power consumption in absolute terms, even if the efficiency is actually worse overall.
The problem with the big cores is that they leak power and become increasingly inefficient the lower the clock speed and utilization is. So it turns out that when the workload is very light, the smaller cores end up consuming less power even if the execution is much slower and inefficient and that's very useful because burst workloads don't always matter.
If AMD goes that route, I wonder if there will still be 8, 12. 16 cores big and smaller to accompany those or the number of the big cores will be lowered.
Apple will switch to 4nm for their 2022 releases in Q4. Then Zen 4 can use 5nm sometime in 2022.
Intel 10nm Superfin happends in Q3 this year, and density is close to TSMC 7nm.
I can't wait to see Intel 7nm vs AMD on TSMC 5 or 4nm in 2023 tho.
I will upgrade in 2023-2024 again, perfect timing. Big leap will happen in 2023-2024 for sure and DDR5 will have matured by then too + PCIe 5.0 is standard across the board. GPU prices and availablity normalized too. New rig incoming.
Will there be a need to run both x64 and Arm applications natively on the same system? Quite possibly, I say, as a part of the transition from x86 to Arm. The transition can take many years and never even be complete, so many apps will be available for one or the other architecture exclusively. Both will need to run efficiently, without translation, on the PC.
Will there be a way? It's fundamentally, in all caps, possible. The x64 code would use the x64 version of system libraries, and those would avoid calling Arm libraries whenever possible. A call to Arm have to wake up a (possibly sleeping) Arm core, and apart from that, it would be more complicated and slower than a simple subroutine call. Arm system libraries would have to be able to process data with Arm or x64 byte order*, and that's hard. And so on. It's up to Microsoft to decide if it's worth the hassle.
As for the Raspberry ... my 4-core PC probably has more than 4 Arm cores hiding in the SSD and other peripherals, no need for another one.
* or not? From Wikipedia: "Some instruction set architectures allow running software of either endianness on a bi-endian architecture. This includes ARM AArch64..." Schedulers that learn and adapt to each use case. I suspect we already have them in our PCs, in some form, as scheduling is a very complex task even without heterogeneous cores (due to NUMA, hyperthreading, etc.)
en.wikipedia.org/wiki/Completely_Fair_Scheduler
I'm sure Linux has been upgraded since then, but that's what was taught in my college years, so its the only scheduler I'm really familiar with. The Wikipedia link has a decent description: The leftmost node in a Red/Black tree is the node that has the highest priority. Priorities change based off of dynamic scheduling: that is, Linux is adding and subtracting from the priority number in an attempt to maximize responsiveness, throughput, and other statistics. Its pretty dumb all else considered, but these algorithms work quite well when all cores are similar.
Modern schedulers also account for "Hot" cores, where L1 / L2 / L3 is already primed with the data associated with a task (aka: Thread Affinity), NUMA (the distance that data has to travel to get to RAM). There are issues like "Priority Inversion" (Task A is a high priority task for some reason. Task Excel-Spreadsheet is low priority. But for some reason, Task A is waiting on Task Excel Spreadsheet. So the scheduler needs to detect this situation and temporarily increase Excel-Spreadsheet's priority so that Task A can resume quicker).
------------
I guess you can say that "Schedulers" are adaptive like branch predictors and L1 caches. They follow a set of dumb rules that works in practice, allowing for basic levels of adaptation. But there's no AI here, its just a really good set of dumb rules that's been tweaked over the past 40 years to get good results on modern processors.
Scheduling is provably NP complete. The only way to find the optimal schedule is to try all combinations of choices. Alas: if you did that, you'd spend more time scheduling rather than running the underlying programs!!! Schedulers need to run in less than 10-microseconds to be effective (any slower, and you start taking up way more time than the underlying programs).
----------------
Honestly? I think the main solution is to just have a programmer flag. Just like Thread Affinity / NUMA Affinity, you can use heuristics to have a "sane default" but not really work in all cases. Any programmer who knows about modern big.LITTLE architecture can just say "Allocate little-thread" (a thread that's Affinity to a little-core) explicitly, because said programmer knows that his thread works best on LITTLE for some reason.
That's how the problem is "solved" for NUMA and core-affinity already. Might as well keep that solution. Then, have Windows developers go through all of the system processes, and test individually which ones work better on LITTLE vs big cores and manually tweak the configuration of Windows until its optimal.
If you can't solve the problem in code, solve the problem with human effort. There may be thousands of Windows-processes, but you only have to do the categorization step once. Give a few good testers / developers 6 months on the problem, and you'll probably get adequate results that will improve over the next 2 years.
ARM CPUs exist in your SSD as closed systems, the OS isn't aware they exists and thats why that works, only the firmware and maybe the driver are aware of the ARM cores. All Zen CPUs actually have a ARM CPU in them now for as part of their security platform but again these are closed systems that the user and OS are not aware of.
[INDENT]Zen added support for AMD's Secure Memory Encryption (SME) and AMD's Secure Encrypted Virtualization (SEV). Secure Memory Encryption is real-time memory encryption done per page table entry. Encryption occurs on a hardware AES engine and keys are managed by the onboard "Security" Processor (ARM Cortex-A5) at boot time to encrypt each page, allowing any DDR4 memory (including non-volatile varieties) to be encrypted. AMD SME also makes the contents of the memory more resistant to memory snooping and cold boot attacks[/INDENT]
As an example, the OS runs x86 with an ARM emulator thats compatible with a secondary ARM CPU
We're gunna end up seeing more and more ARM in the mobile/laptop space, i suspect we'll see desktop emulating/hardware supporting ARM rather than ARM Supporting x86
IE: Jazelle (ARM's Java-bytecode emulator)? The x87 floating point coprocessor was relatively easy, but still kinda weird. There's also the coprocessors on Rasp. Pi (called PIO): www.raspberrypi.org/blog/what-is-pio/. Cell phones also have DSP chips, and modern x86 computers often have embedded iGPUs that are very similar to coprocessors.
They're all interesting and cool. But a giant pain in the ass. I don't think the typical mainstream programmer (or system engineer) would want to deal with this crap. Coprocessors with an alternative instruction set add a stupid amount of complexity to any project. Its absolutely doable, but... its not really something you just do willy-nilly.
Jazelle is arguably a failure entirely. Rasp. Pi PIO is useful for GHz-level functionality, so there's a huge benefit to performance and flexibility. However, with only 128-bytes of code space and something like 32-bytes of SRAM, its not exactly an easy coprocessor. (Its so small so that it can achieve GHz level capabilities). iGPUs and DSPs are entirely different architectures that grossly improve performance. (Kinda like PIO: by changing the computer dramatically, the performance can be improved).
----------
ARM and x86? They're both application-level instruction sets. In fact, ARM and x86 are so similar these days, it isn't very hard to emulate them (see Rosetta) on each other. Both ARM and x86 are deeply-speculative, branch predicted, out-of-order, superscalar, pipelined cores with 32 registers + SIMD subset (some 512-bit like SVE / AVX512, some 128-bit like SSE or Neon)... and that SIMD-subset includes AES cryptography with pmull / pclmulqdq for GCM acceleration. As such, both ARM and x86 can emulate each other almost perfectly.
Its not like DSPs or iGPUs or Rasp. Pi's PIO (which are fundamentally different machine models). ARM and x86 have basically stolen each other's designs from the top down and are really, really similar these days. The only exceptions I can think of are ARM's "brev" and Intel's pdep / pext instructions. But pretty much every other instruction can be found in the other instruction set. (I guess ARM / Intel took different approaches to their SIMD-byte swapping routines... but we're reaching into the obscure to find differences)
And the hardware would be even more of a bloated mess and die space is precious relestate. Maybe the front-end of the CPU could be shared between the ISAs but performance would be garbage and you'd have a ton of wasted silicon. If you didn't share the front-end then its like two CPUs in one and you'd need some kind of hardware mediator that knows what the different cores are doing, again performance would be garbage and even more wasted silicon.
If ARM ever takes off on Windows the way to do it would be to what essentially what Apple is doing with Rosetta. At the end of the day all CPUs do the same thing; simple maths and loads and stores of those values. The key would be to build the front end of your ARM CPU and your translation software together so they can efficiently translate the x86 instructions into ARM instructions, as once the instructions are broken down they are all doing the same things. To that end the hardware could be customized to run programs written for a different ISA but you'd never do a complete front to back two different different cores in one CPU.
--------------------------------------------------------------------------------------------------------------
On topic, there's a possibility for AMD if they borrow some "console architecture", composed smaller core as RISC ( modified CISC Zen core) and a regular x86 Zen core, make specific I/O request on address bus, cross bar on control bus and enhance Infinity Fabric as wide data bus.
That programmer flag ... yes, I had a similar idea, but more like a numerical parameter that tells how much a program benefits from running on a faster core. After some thinking, I don't think that many developers and testers would be able and willing to determine that for system processes (or applications, for that matter). One would need to measure performance and power consumption, and do it under various CPU load conditions. Differences would be subtle, not great, and a system process can hardly be tested in isolation. So, instead of manual flagging (or in addition to it), the scheduler would have to consider some power-related data. The CPU cores, in turn, would have to provide that data by means of some kind of performance counters and energy meters.
A hell of a scheduler, right? How is scheduling done on Android, where cores of different sizes are the most common case?
After a brief 5-minute search (sometimes you gotta just know the right keywords), it seems that "Energy Aware Scheduler" is the current state-of-the-art for Linux (and therefore Android). community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/energy-aware-scheduling-in-linux. It seems to do what you say: measuring power consumption and running metrics to "assume" future power consumption of tasks. From there, it chooses cores which will minimize task-energy usage.
At least, state of the art for 2019. So I'll assume that's what's going on for now unless someone else tells me of a more recent Linux scheduler.
The details of the EAS are discussed here: lore.kernel.org/lkml/20181016101513.26919-1-quentin.perret@arm.com/
AMD ALSO caught nvidia with the HD 5000 series, then proceeded to rebrandeon the lineup as the HD 6000s, only to get BTFOd by fermi 2.0 and had to rush the 6900 series to market (which never quite caught up to nvidia).
Or going WAY back, they got used to releasing processors on intel's chipsetes and sockets, and figuring this would last forever, were left with their pants down when intel refused to license socket 370.
AMD is no stranger to falling asleep at the wheel and half assing things. They havent had as many chances as intel has, but it does happen.
Warhol had been cancelled due to semiconductors shortages and primarily in favor of Zen 4. There will be a small refresh but nothing else (Ryzen 5000 XTX). Zen 4 will be a monstrous architecture. A core count bump is very unlikely with Raphael. Release of Zen 4 will be about 3/4Q 2022. AMD is not going to release anything major anytime soon due to intel's Alder Lake being a meme.
Would be good enough for HTPC (home)servers or internet/office PC's, or as a fall-back if your GPU dies or diagnostics, anything would be better than nothing.
When will they be first to create their own new technology, Never!