Sunday, October 4th 2015

AMD Zen Features Double the Per-core Number Crunching Machinery to Predecessor

Oct 4th, 2015 23:45 Discuss (85 Comments)

AMD "Zen" CPU micro-architecture has a design focus on significantly increasing per-core performance, particularly per-core number-crunching performance, according to a 3DCenter.org report. It sees a near doubling of the number of decoder, ALU, and floating-point units per-core, compared to its predecessor. In essence, the a Zen core is AMD's idea of "what if a Steamroller module of two cores was just one big core, and supported SMT instead."

In the micro-architectures following "Bulldozer," which debuted with the company's first FX-series socket AM3+ processors, and running up to "Excavator," which will debut with the company's "Carrizo" APUs, AMD's approach to CPU cores involved modules, which packed two physical cores, with a combination of dedicated and shared resources between them. It was intended to take Intel's Core 2 idea of combining two cores into an indivisible unit further.

AMD's approach was less than stellar, and was hit by implementation problems, where software sequentially loaded cores in a multi-module processor, resulting in a less than optimal scenario than if they were to load one core per module first, and then load additional cores across modules. AMD's workaround tricked software (particularly OS schedulers) into thinking that a "module" was a "core" which had two "threads" (eg: an eight-core FX-8350 would be seen by software as a 4-core processor with 8 threads).

In AMD's latest approach with "Zen," the company did away with the barriers that separated two cores within a module. It's one big monolithic core, with 4 decoders (parts which tell the core what to do), 4 ALUs ("Bulldozer" had two per core), and four 128-bit wide floating-point units, clubbed in two 256-bit FMACs. This approach nearly doubles the per-core number-crunching muscle. AMD implemented an Intel-like SMT technology, which works very similar to HyperThreading.

Source: 3DCenter.org

Add your own comment

85 Comments on AMD Zen Features Double the Per-core Number Crunching Machinery to Predecessor

#76

Aquinus

Resident Wat-man

TheGuruStudYou're high if you think ARM can do general processing with the power that x86 has.

Go ahead and run a real app on one core of x86 and that pitiful arm chip. Guess what is going to happen? Synthetic benchmarks are more useless than ever.

Depends on the task. If most of the application's instructions are simple integer math operations for data and addresses, an ARM CPU will do pretty well because both x86 and ARM architectures will execute instructions like this in a single cycle. The time when this becomes a difference is when you start considering the more complex instructions offered by CISC instruction set CPUs like x86. Extensions like SSEx where introduced to do what would normally take several clock cycles and reduces it to only a handful if not a single cycle. However, that comes at a cost. It requires more circuity and transistors to have the extra logic to do these more complex instructions quickly. The result is higher manufacturing costs and higher power consumption but, on the other hand you can get significantly improved performance depending on the application.

So I won't say that ARM is crap compared to x86 because, it depends on what you're doing. If you're using a browser, reading/writing email, or playing a simple game like Angry Birds, an ARM CPU is more than enough. However, if you're doing video encoding, physics processing, or really any floating point math application, you're better off with something that can do a little more in a little less time but, that only helps you if you have the power to spare.

I just thought that a more balanced perspective on the matter was required because neither architectures are bad, it's just that they were designed with different things in mind under different philosophies.

#77

Titus Joseph

To add to that, x86 is designed for larger platforms like server, desktop, laptops.

Arm is mobile platform specific as of now. So the requirements and there for the capacity varies. Intel's x86 mobile processors which are being used in ASUS ZenFone series which I m using, are great performers but at the cost of higher power consumtion. I had to implement tonnes of tweaks to get a steady 10 hour continuous usage from it from 100% to 0%.

That its self says a lot.

#78

ZeDestructor

TheGuruStudYou're high if you think ARM can do general processing with the power that x86 has.

Go ahead and run a real app on one core of x86 and that pitiful arm chip. Guess what is going to happen? Synthetic benchmarks are more useless than ever.

AquinusDepends on the task. If most of the application's instructions are simple integer math operations for data and addresses, an ARM CPU will do pretty well because both x86 and ARM architectures will execute instructions like this in a single cycle. The time when this becomes a difference is when you start considering the more complex instructions offered by CISC instruction set CPUs like x86. Extensions like SSEx where introduced to do what would normally take several clock cycles and reduces it to only a handful if not a single cycle. However, that comes at a cost. It requires more circuity and transistors to have the extra logic to do these more complex instructions quickly. The result is higher manufacturing costs and higher power consumption but, on the other hand you can get significantly improved performance depending on the application.

So I won't say that ARM is crap compared to x86 because, it depends on what you're doing. If you're using a browser, reading/writing email, or playing a simple game like Angry Birds, an ARM CPU is more than enough. However, if you're doing video encoding, physics processing, or really any floating point math application, you're better off with something that can do a little more in a little less time but, that only helps you if you have the power to spare.

I just thought that a more balanced perspective on the matter was required because neither architectures are bad, it's just that they were designed with different things in mind under different philosophies.

Titus JosephTo add to that, x86 is designed for larger platforms like server, desktop, laptops.

Arm is mobile platform specific as of now. So the requirements and there for the capacity varies. Intel's x86 mobile processors which are being used in ASUS ZenFone series which I m using, are great performers but at the cost of higher power consumtion. I had to implement tonnes of tweaks to get a steady 10 hour continuous usage from it from 100% to 0%.

That its self says a lot.

Not exactly.. comparing ARM to x86 is hard, because of how easy it is to build a custom ARM SoC vs an x86 SoC, so really, you need to be comparing them in each segment they're in.

Server:

In serverland, you're mostly limited by whether or not you can scale horizontally (as far as CPUs go). If you can scale horizontally, the ARM chips are competitive with x86 in terms of overall/total cost, because what you trade in in terms of per-CPU performance, you regain back in terms of being able to fit more machines in the same space (a high-end ARM SoC fits in about the same sort of space a RasPi takes, while Atom, for all it's low-power-ness, still needs about twice as much space).

The result is that some companies, like Linode, are using ARM chips for their low-power use cases, and others, like Google and Facebook, are considering ARM alongside POWER.

Desktop:

ARM hasn't had a desktop chip since the original Acorn RISC machines. Still, as far as basic browser/productivity/media use goes, something like the Shield microconsole or RasPi 2 running a better OS than Android do OK. Not amazing (mostly because of limited RAM), but OK.

Laptop:

ARM Chromebooks do about as well as low-end x86 chromebooks, but very few use high-end SoCs like Tegra K1/X1, so they lose out to their x86 brethren. Linux users also have fun with it, but in the end, nothing really beats a proper high-end Windows laptop (Like an XPS13 for example) running Linux.

Phones & Tablets:

In the phone and tablet arena, x86 has mostly been hampered by overall platform power and complexity, not performance. If you look at the landscape, one company stands above all the others: Qualcomm. Qualcomm has such a position because of their ability to pack the CPU, GPU, DSPs, ISP, modem, wifi, bluetooth, GPS all into the same die, then strap the RAM on top of the same package. This makes the board design really, really simple, since all you have to do is wire up the sensors (camera, accelerometer, gyro, barometer, temperature), radio frontends, PMIC, screen, digitizer and you're done. On x86, as of right now, you have to put on the CPU/RAM package, then wire up the various wireless interfaces (cellular, GPS, and wifi/BT) to the SoC. With 3 more extra fairly hefty chips to put on the board, things get expensive and raises idle battery usage a fair chunk. This is why the ZenFone 2 is the only phone using x86, and it shows compared to Qcomm. Other companies (MediaTek) also have more integrated stuff, similar to Qcomm.

As for the CISC/RISC argument.. that argument sailed a loooong time ago, around when IBM (POWER), Intel (Pentium Pro/II) and ARM (ARM8/ARMv4 I think, because of their use of speculative processing.. arguably even ARM7T/ARMv4T because of it's pipelining) went Out-of-Order/speculative/superscalar because they all decode the instructions into micro-ops and are no longer run directly. The myth kinda lived on for a while though, because Intel was, well, kinda crap at platform design compared to IBM - most of their CPUs, up until the P4 and Nehalem jumps really, were memory-starved by the FSB, and had higher power consumption than the POWER cores, which is a RISC core. That all changed when Apple replaced the hot-running PowerPC 970/G5 (POWER4) with cool-running Core chips, so much that nobody cares about RISC vs CISC anymore outside of people writing raw assembly, and even then, only for ease of use arguments, not performance.

EDIT: On the subject of pipelining: longer pipelines (i.e more stages, iirc SNB-SKL is 11-14 depending on instruction, Bulldozer is 31) let you achieve higher clockspeeds, but the longer the pipeline, the worse the penalty of a pipeline stall (from a branch mispredict causing a flush, for example). The problem with a long pipeline lies in the stall penalty just going up somewhat exponentially, as Intel learnt with NetBurst, and AMD through Bulldozer (though the pipeline length isn't as dominating an issue there as it was with NetBurst...).

#79

Aquinus

Resident Wat-man

ZeDestructorAs for the CISC/RISC argument.. that argument sailed a loooong time ago, around when IBM (POWER), Intel (Pentium Pro/II) and ARM (ARM8/ARMv4 I think, because of their use of speculative processing.. arguably even ARM7T/ARMv4T because of it's pipelining) went Out-of-Order/speculative/superscalar because they all decode the instructions into micro-ops and are no longer run directly. The myth kinda lived on for a while though, because Intel was, well, kinda crap at platform design compared to IBM - most of their CPUs, up until the P4 and Nehalem jumps really, were memory-starved by the FSB, and had higher power consumption than the POWER cores, which is a RISC core. That all changed when Apple replaced the hot-running PowerPC 970/G5 (POWER4) with cool-running Core chips, so much that nobody cares about RISC vs CISC anymore outside of people writing raw assembly, and even then, only for ease of use arguments, not performance.

You misunderstand what I'm saying. RISC CPUs are going to tend to want instructions that all execute quickly and using those instructions to everything, that means you're not going to have instructions that are bulky or relatively slow compared to others. There is an expectation that the faster instructions will be used to do the same kind of operation. The problem is things like SSE exists to accelerate these kinds of workloads, where a more heavy-weight instruction that does multiple things at once very well can save clock cycles.

What I can tell you is while RISC can have complex instruction sets, that's not entirely true for ARM-based CPUs as it may be for others like SPARC. ARM was intended to be low power and cheap, not fast and performant. That's why don't often see ARM CPUs in clusters as you do SPARCs. So while your argument about RISC in general is true, that doesn't hold true for ARM as a RISC. Not all RISCs are made equally and I can tell you that a modern ARM core is much more simple than a modern x86 core.

Edit: Also a side note, load/store architectures tend to require more instructions to do the same thing. So performance aside, this will result in larger applications by size since you must explicitly do all memory operations outside of non-memory related instructions where even on the 68K, you could run operations on variables in memory without explicitly loading them into CPU registers before acting on them. That is more indicative of the RISC/CISC debate as opposed to the ARM/x86 which is quite a bit different.

#80

FordGT90Concept

"I go fast!1!11!1!"

CISC is much faster at specialized workloads (like decryption, encoding, and decoding) than RISC; however, RISC can reasonably compensate for that shortcoming through cheap parallelism but doing so greatly increases complexity of compilers/code. At the end of the day, RISC lands somewhere between CISC and the heavily paralleled workloads modern GPUs are champions of which begs the question: why not make ARM add-in cards for x86 machines? Let CISC of the x86 handle specialized workloads, hand off simple but heavy logic workloads to ARM, and hand off limited logic workloads to GPUs?

#81

profoundWHALE

FordGT90ConceptCISC is much faster at specialized workloads (like decryption, encoding, and decoding) than RISC; however, RISC can reasonably compensate for that shortcoming through cheap parallelism but doing so greatly increases complexity of compilers/code. At the end of the day, RISC lands somewhere between CISC and the heavily paralleled workloads modern GPUs are champions of which begs the question: why not make ARM add-in cards for x86 machines? Let CISC of the x86 handle specialized workloads, hand off simple but heavy logic workloads to ARM, and hand off limited logic workloads to GPUs?

So basically, take the big.LITTLE design of ARM, but instead, bigCISC.LITTLEarm + GPU?

2-4 low power ARM
4-8 high power x86
???? GPU clusters

So I'm assuming that it would run similar to how I have my OS on an SSD and Games on a large HDD. The ARM processors would handle the basic system functions (filesystem, networking, sound?), the x86 processor would handle processes that need both grunt, and have no GPU acceleration, and would handle the draw calls? Or would that be the ARM?

I'm way in over my head. I feel like I was on the right track and flew off of them.

#82

Super XP

I am not against AMD charging a bit more for the ZEN desktop CPU's if and when they match and outperform its competition. But seeing AMD's past pricing scheme, the only Processors they actually over charged was the Quad-FX compatible CPU's. They were a complete ripoff.

I trust AMD will price ZEN in a fair manner.

#83

ZeDestructor

profoundWHALESo basically, take the big.LITTLE design of ARM, but instead, bigCISC.LITTLEarm + GPU?

2-4 low power ARM
4-8 high power x86
???? GPU clusters

So I'm assuming that it would run similar to how I have my OS on an SSD and Games on a large HDD. The ARM processors would handle the basic system functions (filesystem, networking, sound?), the x86 processor would handle processes that need both grunt, and have no GPU acceleration, and would handle the draw calls? Or would that be the ARM?

I'm way in over my head. I feel like I was on the right track and flew off of them.

The difference of CISC vs RISC is largely academic in these days of superscalar processing with internal microcode, and big, expensive instruction decode stages: each instruction is turned into an internal micro-op, not run directly, thus making the actual execution style identical for both types. For high-performance chips at least.

As for straight performance, it's largely gated by power and die size these days, and in those arenas, Intel holds the crown across the board. In fact, current x86 performance is so good that in terms of performance/W, Intel is ahead of all the various ARM cores, but at the cost of having higher total power consumption and heat. It is for that reason that you don't see ARM in most scale-out server deployments.

As a result, there's simply no point in having a heterogenous architecture that mixes x86 and ARM, and AMD knows that, as do Intel, Samsung, Qualcomm etc.

#84

Aquinus

Resident Wat-man

ZeDestructorThe difference of CISC vs RISC is largely academic in these days of superscalar processing with internal microcode, and big, expensive instruction decode stages: each instruction is turned into an internal micro-op, not run directly, thus making the actual execution style identical for both types. For high-performance chips at least.

As for straight performance, it's largely gated by power and die size these days, and in those arenas, Intel holds the crown across the board. In fact, current x86 performance is so good that in terms of performance/W, Intel is ahead of all the various ARM cores, but at the cost of having higher total power consumption and heat. It is for that reason that you don't see ARM in most scale-out server deployments.

As a result, there's simply no point in having a heterogenous architecture that mixes x86 and ARM, and AMD knows that, as do Intel, Samsung, Qualcomm etc.

Actually one of the differences that exist still to this day is that most RISC CPUs don't tend to combine regular instructions and memory operations (load/store). For example, in x86, you may have an instruction that takes two operands (say ADD,) but, that last operand could be either a register or a memory location. This basically means that the output of the instruction should get stored directly into memory. RISC CPUs aren't like this, in fact you have to explicitly say load this memory location into register n or store this register n into a memory location. There are advantages to doing both of these methods. Separate load/store instructions allows for a simpler pipeline because no instruction will ever have to do a memory operations contained within the same instruction. RISC CPUs also tend to have a lot of general purpose registered with makes this even more feasible. Depending on the application, keeping variables in registers until a full computation is done means less pressure on the memory controller and cache as well as faster turnaround time since CPU registers are the fastest storage in a CPU.

I wanted to point this out because while a lot of the differences between CISC and RISC CPUs have evaporated, there are some things like the LOAD/STORE bit that still tends to hold true. IIRC, I want to say that RISC CPUs tend to be much more rigid in terms of the number of operands that can be provided to any given instruction as well where there are some X86 instructions that are essentially multi-arity instructions. These small things tend to result in a smaller pipeline on ARM and other RISC CPUs compared to their X86 counterparts.

#85

ZeDestructor

AquinusActually one of the differences that exist still to this day is that most RISC CPUs don't tend to combine regular instructions and memory operations (load/store). For example, in x86, you may have an instruction that takes two operands (say ADD,) but, that last operand could be either a register or a memory location. This basically means that the output of the instruction should get stored directly into memory. RISC CPUs aren't like this, in fact you have to explicitly say load this memory location into register n or store this register n into a memory location. There are advantages to doing both of these methods. Separate load/store instructions allows for a simpler pipeline because no instruction will ever have to do a memory operations contained within the same instruction. RISC CPUs also tend to have a lot of general purpose registered with makes this even more feasible. Depending on the application, keeping variables in registers until a full computation is done means less pressure on the memory controller and cache as well as faster turnaround time since CPU registers are the fastest storage in a CPU.

I wanted to point this out because while a lot of the differences between CISC and RISC CPUs have evaporated, there are some things like the LOAD/STORE bit that still tends to hold true. IIRC, I want to say that RISC CPUs tend to be much more rigid in terms of the number of operands that can be provided to any given instruction as well where there are some X86 instructions that are essentially multi-arity instructions. These small things tend to result in a smaller pipeline on ARM and other RISC CPUs compared to their X86 counterparts.

As I understand it, those are architectural preferences by designers than actual traits of RISC vs CISC, which is why I tend to ignore it in favour of directly comparing the number of instructions available on each and the internal implementations. I mean, sure, a tiny ARM Cortex-M4 is essentially a direct implementation of the ISA, but a high-performance POWER8 design is much closer to a modern OoO CISC design like x86, and it shows when you compare dies and power consumption to performance...

Add your own comment

AMD Zen Features Double the Per-core Number Crunching Machinery to Predecessor

85 Comments on AMD Zen Features Double the Per-core Number Crunching Machinery to Predecessor

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

AMD Zen Features Double the Per-core Number Crunching Machinery to Predecessor

Related News

85 Comments on AMD Zen Features Double the Per-core Number Crunching Machinery to Predecessor

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts