Monday, October 5th 2015
AMD Zen Features Double the Per-core Number Crunching Machinery to Predecessor
AMD "Zen" CPU micro-architecture has a design focus on significantly increasing per-core performance, particularly per-core number-crunching performance, according to a 3DCenter.org report. It sees a near doubling of the number of decoder, ALU, and floating-point units per-core, compared to its predecessor. In essence, the a Zen core is AMD's idea of "what if a Steamroller module of two cores was just one big core, and supported SMT instead."
In the micro-architectures following "Bulldozer," which debuted with the company's first FX-series socket AM3+ processors, and running up to "Excavator," which will debut with the company's "Carrizo" APUs, AMD's approach to CPU cores involved modules, which packed two physical cores, with a combination of dedicated and shared resources between them. It was intended to take Intel's Core 2 idea of combining two cores into an indivisible unit further.AMD's approach was less than stellar, and was hit by implementation problems, where software sequentially loaded cores in a multi-module processor, resulting in a less than optimal scenario than if they were to load one core per module first, and then load additional cores across modules. AMD's workaround tricked software (particularly OS schedulers) into thinking that a "module" was a "core" which had two "threads" (eg: an eight-core FX-8350 would be seen by software as a 4-core processor with 8 threads).
In AMD's latest approach with "Zen," the company did away with the barriers that separated two cores within a module. It's one big monolithic core, with 4 decoders (parts which tell the core what to do), 4 ALUs ("Bulldozer" had two per core), and four 128-bit wide floating-point units, clubbed in two 256-bit FMACs. This approach nearly doubles the per-core number-crunching muscle. AMD implemented an Intel-like SMT technology, which works very similar to HyperThreading.
Source:
3DCenter.org
In the micro-architectures following "Bulldozer," which debuted with the company's first FX-series socket AM3+ processors, and running up to "Excavator," which will debut with the company's "Carrizo" APUs, AMD's approach to CPU cores involved modules, which packed two physical cores, with a combination of dedicated and shared resources between them. It was intended to take Intel's Core 2 idea of combining two cores into an indivisible unit further.AMD's approach was less than stellar, and was hit by implementation problems, where software sequentially loaded cores in a multi-module processor, resulting in a less than optimal scenario than if they were to load one core per module first, and then load additional cores across modules. AMD's workaround tricked software (particularly OS schedulers) into thinking that a "module" was a "core" which had two "threads" (eg: an eight-core FX-8350 would be seen by software as a 4-core processor with 8 threads).
In AMD's latest approach with "Zen," the company did away with the barriers that separated two cores within a module. It's one big monolithic core, with 4 decoders (parts which tell the core what to do), 4 ALUs ("Bulldozer" had two per core), and four 128-bit wide floating-point units, clubbed in two 256-bit FMACs. This approach nearly doubles the per-core number-crunching muscle. AMD implemented an Intel-like SMT technology, which works very similar to HyperThreading.
85 Comments on AMD Zen Features Double the Per-core Number Crunching Machinery to Predecessor
So I won't say that ARM is crap compared to x86 because, it depends on what you're doing. If you're using a browser, reading/writing email, or playing a simple game like Angry Birds, an ARM CPU is more than enough. However, if you're doing video encoding, physics processing, or really any floating point math application, you're better off with something that can do a little more in a little less time but, that only helps you if you have the power to spare.
I just thought that a more balanced perspective on the matter was required because neither architectures are bad, it's just that they were designed with different things in mind under different philosophies.
Arm is mobile platform specific as of now. So the requirements and there for the capacity varies. Intel's x86 mobile processors which are being used in ASUS ZenFone series which I m using, are great performers but at the cost of higher power consumtion. I had to implement tonnes of tweaks to get a steady 10 hour continuous usage from it from 100% to 0%.
That its self says a lot.
Server:
In serverland, you're mostly limited by whether or not you can scale horizontally (as far as CPUs go). If you can scale horizontally, the ARM chips are competitive with x86 in terms of overall/total cost, because what you trade in in terms of per-CPU performance, you regain back in terms of being able to fit more machines in the same space (a high-end ARM SoC fits in about the same sort of space a RasPi takes, while Atom, for all it's low-power-ness, still needs about twice as much space).
The result is that some companies, like Linode, are using ARM chips for their low-power use cases, and others, like Google and Facebook, are considering ARM alongside POWER.
Desktop:
ARM hasn't had a desktop chip since the original Acorn RISC machines. Still, as far as basic browser/productivity/media use goes, something like the Shield microconsole or RasPi 2 running a better OS than Android do OK. Not amazing (mostly because of limited RAM), but OK.
Laptop:
ARM Chromebooks do about as well as low-end x86 chromebooks, but very few use high-end SoCs like Tegra K1/X1, so they lose out to their x86 brethren. Linux users also have fun with it, but in the end, nothing really beats a proper high-end Windows laptop (Like an XPS13 for example) running Linux.
Phones & Tablets:
In the phone and tablet arena, x86 has mostly been hampered by overall platform power and complexity, not performance. If you look at the landscape, one company stands above all the others: Qualcomm. Qualcomm has such a position because of their ability to pack the CPU, GPU, DSPs, ISP, modem, wifi, bluetooth, GPS all into the same die, then strap the RAM on top of the same package. This makes the board design really, really simple, since all you have to do is wire up the sensors (camera, accelerometer, gyro, barometer, temperature), radio frontends, PMIC, screen, digitizer and you're done. On x86, as of right now, you have to put on the CPU/RAM package, then wire up the various wireless interfaces (cellular, GPS, and wifi/BT) to the SoC. With 3 more extra fairly hefty chips to put on the board, things get expensive and raises idle battery usage a fair chunk. This is why the ZenFone 2 is the only phone using x86, and it shows compared to Qcomm. Other companies (MediaTek) also have more integrated stuff, similar to Qcomm.
As for the CISC/RISC argument.. that argument sailed a loooong time ago, around when IBM (POWER), Intel (Pentium Pro/II) and ARM (ARM8/ARMv4 I think, because of their use of speculative processing.. arguably even ARM7T/ARMv4T because of it's pipelining) went Out-of-Order/speculative/superscalar because they all decode the instructions into micro-ops and are no longer run directly. The myth kinda lived on for a while though, because Intel was, well, kinda crap at platform design compared to IBM - most of their CPUs, up until the P4 and Nehalem jumps really, were memory-starved by the FSB, and had higher power consumption than the POWER cores, which is a RISC core. That all changed when Apple replaced the hot-running PowerPC 970/G5 (POWER4) with cool-running Core chips, so much that nobody cares about RISC vs CISC anymore outside of people writing raw assembly, and even then, only for ease of use arguments, not performance.
EDIT: On the subject of pipelining: longer pipelines (i.e more stages, iirc SNB-SKL is 11-14 depending on instruction, Bulldozer is 31) let you achieve higher clockspeeds, but the longer the pipeline, the worse the penalty of a pipeline stall (from a branch mispredict causing a flush, for example). The problem with a long pipeline lies in the stall penalty just going up somewhat exponentially, as Intel learnt with NetBurst, and AMD through Bulldozer (though the pipeline length isn't as dominating an issue there as it was with NetBurst...).
What I can tell you is while RISC can have complex instruction sets, that's not entirely true for ARM-based CPUs as it may be for others like SPARC. ARM was intended to be low power and cheap, not fast and performant. That's why don't often see ARM CPUs in clusters as you do SPARCs. So while your argument about RISC in general is true, that doesn't hold true for ARM as a RISC. Not all RISCs are made equally and I can tell you that a modern ARM core is much more simple than a modern x86 core.
Edit: Also a side note, load/store architectures tend to require more instructions to do the same thing. So performance aside, this will result in larger applications by size since you must explicitly do all memory operations outside of non-memory related instructions where even on the 68K, you could run operations on variables in memory without explicitly loading them into CPU registers before acting on them. That is more indicative of the RISC/CISC debate as opposed to the ARM/x86 which is quite a bit different.
2-4 low power ARM
4-8 high power x86
???? GPU clusters
So I'm assuming that it would run similar to how I have my OS on an SSD and Games on a large HDD. The ARM processors would handle the basic system functions (filesystem, networking, sound?), the x86 processor would handle processes that need both grunt, and have no GPU acceleration, and would handle the draw calls? Or would that be the ARM?
I'm way in over my head. I feel like I was on the right track and flew off of them.
I trust AMD will price ZEN in a fair manner.
As for straight performance, it's largely gated by power and die size these days, and in those arenas, Intel holds the crown across the board. In fact, current x86 performance is so good that in terms of performance/W, Intel is ahead of all the various ARM cores, but at the cost of having higher total power consumption and heat. It is for that reason that you don't see ARM in most scale-out server deployments.
As a result, there's simply no point in having a heterogenous architecture that mixes x86 and ARM, and AMD knows that, as do Intel, Samsung, Qualcomm etc.
I wanted to point this out because while a lot of the differences between CISC and RISC CPUs have evaporated, there are some things like the LOAD/STORE bit that still tends to hold true. IIRC, I want to say that RISC CPUs tend to be much more rigid in terms of the number of operands that can be provided to any given instruction as well where there are some X86 instructions that are essentially multi-arity instructions. These small things tend to result in a smaller pipeline on ARM and other RISC CPUs compared to their X86 counterparts.