Tuesday, April 10th 2012

Trinity (Piledriver) Integer/FP Performance Higher Than Bulldozer, Clock-for-Clock

Apr 10th, 2012 00:11 Discuss (115 Comments)

AMD's upcoming "Trinity" family of desktop and mobile accelerated processing units (APUs) will use up to four x86-64 cores based on the company's newest CPU architecture, codenamed "Piledriver". AMD conservatively estimated performance/clock improvements over current-generation "Bulldozer" architecture, with Piledriver. Citavia put next-generation A10-5800K, and A8-4500M "Trinity" desktop and notebook APUs, and pitted them against several currently-launched processors, from both AMD and Intel.

It found integer and floating-point performance increases clock-for-clock, against Bulldozer-based FX-8150. The benchmark is not multi-threaded, and hence gives us a fair idea of the per core performance. On a rather disturbing note, the performance-per-GHz figures of Piledriver are trailing far behind K12 architecture (Llano, A8-3850), let alone competitive architectures from Intel.

Source: Expreview

Add your own comment

115 Comments on Trinity (Piledriver) Integer/FP Performance Higher Than Bulldozer, Clock-for-Clock

#76

eidairaman1

The Exiled Airman

OneMoark I am unsubbing from this thread until a mod hands out bans to the flamers here

u were one of them...

#77

TheoneandonlyMrK

I actually thought it had not gone too Bad, mayhap because i posted once before this:)

i cant wait for the ins and outs of whats going on to be known in a few / year ,what will the ps3 and nextbox have etc.

im looking forward to a promising Vishera, not an awe inspireing one but id imagine with the new clock mesh tech these Apu's will OC a fair bit, given a reasonable Vreg which is only going to be on the Fm2 Platform not soo much laptops but in that form i can well imagine a modern day console type gameing experience on most pc games, maybe better/ deff better then an xbox360 game , and i can see it running well with an OC.

and with L3 ,more modules/cores and a (important)later possible stepping vishera could end up doing well, I only back Amd at any point due to the fact some are OT given i get 60-80Fps in any game on ultra(addmitedly with hybrid physx for the nv favoured games) with my main rig , an intel system may well do much better but My experience isnt as bad as some of you are making out i dont notice any wait times and these chips are going to perform better then my main rig does at min or should

#78

sergionography

theoneandonlymrkI actually thought it had not gone too Bad, mayhap because i posted once before this:)

i cant wait for the ins and outs of whats going on to be known in a few / year ,what will the ps3 and nextbox have etc.

im looking forward to a promising Vishera, not an awe inspireing one but id imagine with the new clock mesh tech these Apu's will OC a fair bit, given a reasonable Vreg which is only going to be on the Fm2 Platform not soo much laptops but in that form i can well imagine a modern day console type gameing experience on most pc games, maybe better/ deff better then an xbox360 game , and i can see it running well with an OC.

and with L3 ,more modules/cores and a (important)later possible stepping vishera could end up doing well, I only back Amd at any point due to the fact some are OT given i get 60-80Fps in any game on ultra(addmitedly with hybrid physx for the nv favoured games) with my main rig , an intel system may well do much better but My experience isnt as bad as some of you are making out i dont notice any wait times and these chips are going to perform better then my main rig does at min or should

yes thats very true about the clocks, just imagine if a quad core piledriver can do 100watt tdp at 4.2ghz with half of the chip being a gpu clocked at 800mhz, just imagine how far can the quad core piledriver cores go without the gpu, or atleast how efficient they would be
from leaks it seems piledriver is 20% than bulldozer clock-clock so it almost finaly matches phenom II ipc, but offcourse clocks much higher.
once again the phenom/phenom II story going on, with piledriver being what bulldozer was meant to be(and some... or hopefully atleast)

#79

Dent1

sergionographyfrom leaks it seems piledriver is 20% than bulldozer clock-clock so it almost finaly matches phenom II ipc, but offcourse clocks much higher.

OK stop spreading false information - I know you didn't do it deliberately but all this false information needs to be nipped in the bud.

Phenom II is not 20% faster than Bulldozer clock for clock. It's ridiculous to think that. Obviously its application dependant, but on a good day when an application favours Phenom II's architecture we are talking about maybe 5% or less or within margin for error. Overall the Bulldozer is faster.

#80

Aquinus

Resident Wat-man

Dent1OK stop spreading false information - I know you didn't do it deliberately but all this false information needs to be nipped in the bud.

Phenom II is not 20% faster than Bulldozer clock for clock. It's ridiculous to think that. Obviously its application dependant, but on a good day when an application favours Phenom II's architecture we are talking about maybe 5% or less or within margin for error. Overall the Bulldozer is faster.

He said IPC, buddy. Just because BD has a lower IPC it doesn't mean that it doesn't run faster. Once BD's IPC is down to where it was with the Phenom II, there will be a lot more performance because bulldozer clocks that much higher than the Phenom IIs did.

Also all of those tasks that do better on the P2 are single threaded tasks and unoptimized floating point applications and even in both of these cases, the performance is acceptable.

#81

Dent1

AquinusHe said IPC, buddy. Just because BD has a lower IPC it doesn't mean that it doesn't run faster. Once BD's IPC is down to where it was with the Phenom II, there will be a lot more performance because bulldozer clocks that much higher than the Phenom IIs did.

Also all of those tasks that do better on the P2 are single threaded tasks and unoptimized floating point applications and even in both of these cases, the performance is acceptable.

Yes, but those single threaded tasks don't do 20% better as sergionography implied.

#82

eidairaman1

The Exiled Airman

how about we wait for trinity to be in the hands of the reviewers and users here. Same with AM3+ Piledriver

#83

nt300

btarunrOn a rather disturbing note, the performance-per-GHz figures of Piledriver are trailing far behind K12 architecture (Llano, A8-3850), let alone competitive architectures from Intel.

Each and every design is different. Piledriver/Bulldozer is design for higher clock speed. Llano and K12 is not. Just like the Athlon 64 of past it needed less clock speed to beat out Pentium 4 that needed at least an extra 1000MHz to stay competative.

#84

Vulpesveritas

Quite true, both Llano and Trinity are performing @ about 2400 integer score/ GHZ, which is 20% lower than the i5 SB score.
So depending on pricing and the clocks for the lower end chips, Trinity may be fully competitive thanks to it's higher clocks and superior iGPU, especially with IB i3/pentium not coming out till Q3/Q4 and trinity coming out late Q1, early Q2.
And if the earlier rumor of power efficiency is true as well, with it being 15% more power efficient than llano, and given BD OC'd rather decently, the unlocked parts I feel based on what information we currently have available to us at the moment show a great budget part.

Although it is still merely speculation until Wiz gets to do a review.

#85

Aquinus

Resident Wat-man

Dent1Yes, but those single threaded tasks don't do 20% better as sergionography implied.

At the same clock speed, I bet they did. Can we get a review of a Phenom II and BD at the same clock, HTT speeds, and memory speeds? It would answer this question very quickly.

nt300This statement complete rubbish.

Stop trolling and only post if you have something useful to contribute.

#86

Steevo

nt300This statement complete rubbish.

Apparently you missed the chart, or lack the understanding to read it.

They directly compare the FP performance per clock, and the A8 series is raping the A10 and bulldozer if the chart is real.

In other words, AMD may have done nothing but tweaked the core a bit to conserve energy and increase the speed. All joking aside this new architecture is the P4 from AMD.

I keep thinking and saying their only saving grace will be GCN added to a quad core and software enhancement to offload the work to the much faster GPU, however I think they lack the manpower and drive to do it. So I am expecting mediocrity from their next chip after this too. Once they push for it, or pull back to a tweaked for efficency design they have a chance to gain the performance edge.

AMD, please, make software to support your hardware, Intel did it for years, programs would see a Intel chip and optimize performance, you can do it too. I would rather have two cores dedicated to serving data to a on bard GPU with enough stream processors and cache to run full tilt than have 8 cores total. Or do it in hardware, surely 10% die area is worth a exponential increase in performance.

#87

Aquinus

Resident Wat-man

Steevosurely 10% die area is worth a exponential increase in performance.

Bulldozer modules scales almost linearly, hyper-threading does not. I keep telling people that single-threaded applications aren't the future, they're the past. I think people are pandering about things that won't matter in the future since nothing is optimized for BD. (Applications that use FMA3 on BD actually have sizable floating point speed improvements.)

SteevoAMD, please, make software to support your hardware, Intel did it for years, programs would see a Intel chip and optimize performance, you can do it too. I would rather have two cores dedicated to serving data to a on bard GPU with enough stream processors and cache to run full tilt than have 8 cores total.

Clearly you don't know how the development model works. Why would you prepare software for cutting edge hardware that the majority of people don't have. People use technologies when it benefits them, and it benefits software companies when people have hardware that can run their software. That means requiring something like FMA3 puts people who don't have SB or BD at a loss which only hurts the consumer and the software developer.

#88

xenocide

AquinusBulldozer modules scales almost linearly, hyper-threading does not.

That's because HTing isn't intended to serve the same function. It's just there so the CPU can use previously unused resources to get some work done instead of idling. Bulldozer modules do scale well, but the problem is shit scaling linearly is still just shit. Plus it's not as though having 1,000 Cores is better than just 4 good ones for most people.

AquinusI keep telling people that single-threaded applications aren't the future, they're the past. I think people are pandering about things that won't matter in the future since nothing is optimized for BD. (Applications that use FMA3 on BD actually have sizable floating point speed improvements.)

I don't think anyone want's Single-Threaded applications, they are more like an unfortunate reality. This isn't like the Athlon X2 era when people were saying it didn't matter because only a handful of people even had Multi-Core CPU's, at this point just about everything comes with at LEAST a Dual-Core. The issue is that Bulldozer CPU's only get an edge when there are more than 4 Threads and you're using a BD CPU with more than 4 cores. Even then, having such low per-core performance usually results in the Intel CPU's winning out.

#89

Vulpesveritas

SteevoApparently you missed the chart, or lack the understanding to read it.

They directly compare the FP performance per clock, and the A8 series is raping the A10 and bulldozer if the chart is real.

In other words, AMD may have done nothing but tweaked the core a bit to conserve energy and increase the speed. All joking aside this new architecture is the P4 from AMD.

I keep thinking and saying their only saving grace will be GCN added to a quad core and software enhancement to offload the work to the much faster GPU, however I think they lack the manpower and drive to do it. So I am expecting mediocrity from their next chip after this too. Once they push for it, or pull back to a tweaked for efficency design they have a chance to gain the performance edge.

AMD, please, make software to support your hardware, Intel did it for years, programs would see a Intel chip and optimize performance, you can do it too. I would rather have two cores dedicated to serving data to a on bard GPU with enough stream processors and cache to run full tilt than have 8 cores total. Or do it in hardware, surely 10% die area is worth a exponential increase in performance.

Apparently you don't see the point of BD/PD architecture. The very idea behind it is going to sacrifice FP performance by sharing the FP unit between two cores. Given that it is integer performance that matters to what the architecture is being geared for, thus is what is being improved. Note that the integer performance is the same as Llano but clocks 33% higher. Giving the quad core unlocked trinity only a 10% lower integer performance than the i5-2500k while having a superior iGPU. Meaning that it -should- outperform a SB i3, and we won't see IB i3's until Q3/Q4 most likely, and therefore Trinity should hold a very good value spot for up to 6 months, and quite possibly remain competitive with the IB i3's thanks to it's unlocked variants and it's superior iGPU.

Why do I keep mentioning the iGPU? Because AMD's long term plan is HSA, and dumping floating point math onto the iGPU. And HSA functions -should- be available next year, with 22nm steamroller + GCN (expected to be possibly a 7750-equivalent) on die.

Oh, and earlier possible leaks show Trinity to be more power efficient than Llano as well.

Also @ genocide, 20% lower performance / clock and 10% lower performance / watt = shit? I see it as being less efficient and powerful, but it's not like it is only half as powerful. (and I'm saying that based on the slide btw. Given the A10 is a 100w part that is really a 95w CPU + GPU + 5w bridge chip and all. Bulldozer was something of a fail, Piledriver isn't looking to be quite as bad.

I have to agree on the appearance of a Phenom I / II again here.

#90

Steevo

AquinusBulldozer modules scales almost linearly, hyper-threading does not. I keep telling people that single-threaded applications aren't the future, they're the past. I think people are pandering about things that won't matter in the future since nothing is optimized for BD. (Applications that use FMA3 on BD actually have sizable floating point speed improvements.)

Clearly you don't know how the development model works. Why would you prepare software for cutting edge hardware that the majority of people don't have. People use technologies when it benefits them, and it benefits software companies when people have hardware that can run their software. That means requiring something like FMA3 puts people who don't have SB or BD at a loss which only hurts the consumer and the software developer.

At a linear rate of what? Adding more cores to processors doesn't directly improve performance as many threaded items are/have dependancies, so core 0 may be still processing a thread that core 1 needs the result of to start work. IPC is still extremely important, thinking otherwise is naive.

So first you say the way is multi-core, and now you say they shouldn't prepare for multi-core systems? This makes no sense, if we are moving to a multi-core standard (we are) we need to have hardware/software resources to support it, and if developers aren't going to do it, AMD needs to.

#91

Vulpesveritas

SteevoAt a linear rate of what? Adding more cores to processors doesn't directly improve performance as many threaded items are/have dependancies, so core 0 may be still processing a thread that core 1 needs the result of to start work. IPC is still extremely important, thinking otherwise is naive.

So first you say the way is multi-core, and now you say they shouldn't prepare for multi-core systems? This makes no sense, if we are moving to a multi-core standard (we are) we need to have hardware/software resources to support it, and if developers aren't going to do it, AMD needs to.

So given that it has 80-90% of the single core performance while scaling better, i would say there is an advantage. Technically it's single thread performance / watt that is important than pure IPC. Although they -usually- tend to go hand in hand vs clocks.

#92

Dent1

AquinusAt the same clock speed, I bet they did. Can we get a review of a Phenom II and BD at the same clock, HTT speeds, and memory speeds? It would answer this question very quickly.

I've seen many comparison reviews of the two CPUs in question, at the same clock speed, and none was anything close to 20% average increase IPC over the Bulldozer.

I'd be happy to read a review which shows that claim. If anyone has external reading material feel free to post it.

#93

sergionography

Dent1Yes, but those single threaded tasks don't do 20% better as sergionography implied.

they actualy do dude, just go and look at an fx4100 review and compare it to a phenom II 980BE, you are looking an an fx8150 which clocks up to 4.2 and has 8 cores thats why it does better than a typical quad core phenom II, but comparing a quad core bulldozer to a quad core phenom II it fails miserably
only in situations were new instructions sets are supported does bulldozer hold ground,p but in typical use its way behind clock-clock, and yes by 20% if not more
phenom II does 3ipc while bulldozer does 4ipc shared between 2 cores, and because it has such a long pipeline each cycle takes a longer time(which isnt bad because its kinda designed that way so the resources can feed the second core in the module while the first one is munching on data)but things didnt go so well and the latency is worse than expected

www.legitreviews.com/article/1766/17/

heres some of conclusion from legitreview, i wasnt talking out of my ass just so you know

"When it comes to performance we were shocked to see the AMD A8-3850 'Llano' processor and the Socket FM1 platform performing better than the AMD FX-4100 'Bulldozer' processor and the Socket AM3+ platform. We quickly found out that the FX-4100 was priced this low as it needed to be. The performance of the FX-4100 wasn't awful, but we didn't expect to see the AMD A6-3650 running at 2.6GHz to beat the AMD FX-4100 running at 3.6GHz in benchmarks like POV-Ray and Cinebench! "

#94

Steevo

VulpesveritasSo given that it has 80-90% of the single core performance while scaling better, i would say there is an advantage. Technically it's single thread performance / watt that is important than pure IPC. Although they -usually- tend to go hand in hand vs clocks.

Try between 50-90% better.

www.google.com/url?sa=t&rct=j&q=multicore%20performance%20analysis%20gain&source=web&cd=7&ved=0CFsQFjAG&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.84.921%26rep%3Drep1%26type%3Dpdf&ei=BXiIT7ryEIPEgQfuu_TBCQ&usg=AFQjCNHF-2kq4OaqEGHMZBN3z0kXh3zM8A&cad=rja

Intel's own research shows a lack of increase when needing to tie up additional resources to schedule and track data between cores, and result dependency. I agree the initial result of two to four cores is a significant increase as we can offload other threads from the core running our primary worker, or assign different processes to different cores, however the overhead cost starts degrading the performance with more cores.

I was merely asking for a hardware thread handler, and if like Nvidias "hot clocks" it can run at 2 or 4 times the core speed it could easily dispatch and track resources, even handling the offload of work to the GPU cores for faster processing. I understand the unified memory and number of threads/different type of work makes it difficult, but compared to making mediocre processors blazingly fast, what downside is there? If it added 25W of heat but was only used on enthusiast grade processors I would still buy it, as would many.

#95

Vulpesveritas

Umm BD is not 1/10th the processing power of SB, try 70-80% of the power.
That said, I'm fairly sure. Many of us would buy higher clocked chips up to 250w, and it would still sell among us. Given it's still able to be air cooled and all and most mid tower cases supporting 120mm tower coolers. Just leave a cooler out of the box, or give an option for a 155-165mm tall cooler with push-pull fans and a good design and we're set.

#96

sergionography

SteevoTry between 50-90% better.

^this

#97

Aquinus

Resident Wat-man

xenocideThat's because HTing isn't intended to serve the same function. It's just there so the CPU can use previously unused resources to get some work done instead of idling. Bulldozer modules do scale well, but the problem is shit scaling linearly is still just shit. Plus it's not as though having 1,000 Cores is better than just 4 good ones for most people.

I feel like all of my posts in this thread were just posted back to me...

SteevoI was merely asking for a hardware thread handler, and if like Nvidias "hot clocks" it can run at 2 or 4 times the core speed it could easily dispatch and track resources, even handling the offload of work to the GPU cores for faster processing. I understand the unified memory and number of threads/different type of work makes it difficult, but compared to making mediocre processors blazingly fast, what downside is there? If it added 25W of heat but was only used on enthusiast grade processors I would still buy it, as would many.

I think you're confusing how GPUs and CPUs work.

nVidia can do what they do because GPUs dispatch large workloads and runs a calculation on every shader that has data. CPUs don't work like this because you're not bulk processing the same instruction across a ton of data. You have different instructions being run, therefore what you're describing for a CPU is essentially a pipeline, which CPUs already have, but "dispatching" anything will result in less performance in single-threaded instances.

Do you know the basic 4 operations that almost any general purpose CPU does? Not to over-simplify how long a pipeline is, but basically you: LOAD, DECODE, EXECUTE, AND STORE, in that order. At this level, there is no parallelism, is's very step by step in the sense that you can't decode an instruction before you load it, you can't execute an instruction until it has been decoded, and you can't store the result after the instruction has been executed.

#98

Steevo

Yes I am aware, as I am in the process of getting my degree in computer science. C++, Networking, and other classes.

A single thread on a CPU might run the four, but if we have a hardware scheduler that reads ahead and prefetches data "branching" and then performs the decode at twice the rate, programs shaders to do the work, and then they execute it and store it in the contiguous memory pool what difference does it make if the CPU transistors do it, or if the same instruction is run 5,000 times in the program, the GPU transistors do it.

Pretty simple actually, GPU's already do 90% of this work to keep up with demand. The hardest part would be resource tracking, but again, if they solve it and the performance increase is only 25% better on average, they win.

#99

Aquinus

Resident Wat-man

SteevoA single thread on a CPU might run the four, but if we have a hardware scheduler that reads ahead and prefetches data "branching" and then performs the decode at twice the rate, programs shaders to do the work, and then they execute it and store it in the contiguous memory pool what difference does it make if the CPU transistors do it, or if the same instruction is run 5,000 times in the program, the GPU transistors do it.

Except you can't process a regular application through a pipeline like a GPU has because GPU data is all the same where a computer program has multiple different instructions per clock cycle. A GPU is given a large set of data and told to do a single task to all of it, so it does it the same way. A CPU is instruction after instruction, there isn't a whole lot that represents what the GPU can do.

A shader is small because it has a limited number of instructions it can perform and has no control mechanism, no write back. There is no concept of threads in a GPU, it is an array of one or more sets of data that will have the same operation performed on the entire set. A shader is also SIMD, not MIMD as you're describing.

Where a CPU can carry out instructions like "move 10 bytes from memory location A to memory location B," A GPU does something more like "multiply every item in the array by 1.43."

SteevoPretty simple actually, GPU's already do 90% of this work to keep up with demand. The hardest part would be resource tracking, but again, if they solve it and the performance increase is only 25% better on average, they win.

If it is so simple, why hasn't anyone else figured it out, I'm sill convinced that you don't quite know what you're talking about.

SteevoYes I am aware, as I am in the process of getting my degree in computer science. C++, Networking, and other classes.

I do have a bachelors degree in computer science not to mention I'm employed as a systems admin and a developer.

#100

sergionography

AquinusExcept you can't process a regular application through a pipeline like a GPU has because GPU data is all the same where a computer program has multiple different instructions per clock cycle. A GPU is given a large set of data and told to do a single task to all of it, so it does it the same way. A CPU is instruction after instruction, there isn't a whole lot that represents what the GPU can do.

A shader is small because it has a limited number of instructions it can perform and has no control mechanism, no write back. There is no concept of threads in a GPU, it is an array of one or more sets of data that will have the same operation performed on the entire set. A shader is also SIMD, not MIMD as you're describing.

Where a CPU can carry out instructions like "move 10 bytes from memory location A to memory location B," A GPU does something more like "multiply every item in the array by 1.43."

If it is so simple, why hasn't anyone else figured it out, I'm sill convinced that you don't quite know what you're talking about.

I do have a bachelors degree in computer science not to mention I'm employed as a systems admin and a developer.

true, unless u can get the cpu to sort things out and let the gpu do what its best. meaning the cpu can fetch and decode, then execution will be determined whether more efficient on the cpu or gpu
i think thats the approach amd is taking with apu's in the future(HSA)

Add your own comment

Trinity (Piledriver) Integer/FP Performance Higher Than Bulldozer, Clock-for-Clock

115 Comments on Trinity (Piledriver) Integer/FP Performance Higher Than Bulldozer, Clock-for-Clock

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

Trinity (Piledriver) Integer/FP Performance Higher Than Bulldozer, Clock-for-Clock

Related News

115 Comments on Trinity (Piledriver) Integer/FP Performance Higher Than Bulldozer, Clock-for-Clock

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts