# Trinity (Piledriver) Integer/FP Performance Higher Than Bulldozer, Clock-for-Clock



## btarunr (Apr 10, 2012)

AMD's upcoming "Trinity" family of desktop and mobile accelerated processing units (APUs) will use up to four x86-64 cores based on the company's newest CPU architecture, codenamed "Piledriver". AMD conservatively estimated performance/clock improvements over current-generation "Bulldozer" architecture, with Piledriver. Citavia put next-generation A10-5800K, and A8-4500M "Trinity" desktop and notebook APUs, and pitted them against several currently-launched processors, from both AMD and Intel. 

It found integer and floating-point performance increases clock-for-clock, against Bulldozer-based FX-8150. The benchmark is not multi-threaded, and hence gives us a fair idea of the per core performance. On a rather disturbing note, the performance-per-GHz figures of Piledriver are trailing far behind K12 architecture (Llano, A8-3850), let alone competitive architectures from Intel. 





*View at TechPowerUp Main Site*


----------



## xenocide (Apr 10, 2012)

That table is odd, why are several CPU's listed twice when thers are only listed once?


----------



## Oberon (Apr 10, 2012)

> On a rather disturbing note, the performance-per-GHz figures of Piledriver are trailing far behind K12 architecture (Llano, A8-3850), let alone competitive architectures from Intel.



*sigh* More "RAWR, NEED MOAR IPC" nonsense.

Question: If I produced with a CPU with half the throughput per clock as BD but clocked it five times higher, would anyone really complain about IPC? Didn't think so. Unless you're doing architectural comparisons between mildly like architectures, nobody _actually_ cares about IPC.


----------



## slyfox2151 (Apr 10, 2012)

Oberon said:


> *sigh* More "RAWR, NEED MOAR IPC" nonsense.
> 
> Question: If I produced with a CPU with half the throughput per clock as BD but clocked it five times higher, would anyone really complain about IPC? Didn't think so. Unless you're doing architectural comparisons between mildly like architectures, nobody _actually_ cares about IPC.



except it doesnt clock 5x as much.... AMD has the same clocks as Intel... so yes IPC does matter.


----------



## BeepBeep2 (Apr 10, 2012)

xenocide said:


> That table is odd, why are several CPU's listed twice when thers are only listed once?


Because BOINC is rather inconsistent...
Basically the chart shows the FP went up 0-5%, and int went up about 10% at same clock.

Trinity has no L3 cache so add about 5% performance for 5-10% FP and 15% int at same clock when Vishera comes around.

Now if they can get clock speeds up 200-400 MHz stock for Vishera desktop products over current BD (think 3.8-4 GHz with 4.6 turbo), maybe they have a product of worth which would have ~10-20% better performance in single thread than Bulldozer.


----------



## NC37 (Apr 10, 2012)

Certainly has my interest for mobile if the price is right but I'm leaning more and more towards Intel Ivy build on the desktop area.


----------



## Oberon (Apr 10, 2012)

slyfox2151 said:


> except it doesnt clock 5x as much.... AMD has the same clocks as Intel... so yes IPC does matter.



In the face of EVERY OTHER difference? No, not really.


----------



## xenocide (Apr 10, 2012)

Oberon said:


> nobody _actually_ cares about IPC.



Except for all those people who make a living doing stuff like encoding and rendering, and all those people that play games.


----------



## Vulpesveritas (Apr 10, 2012)

xenocide said:


> Except for all those people who make a living doing stuff like encoding and rendering, and all those people that play games.



Well, no not really if it had half the IPC and 5x the clocks at the same wattage, then it is still 2.5x faster/ watt.  Which is really what people who do things like encoding, rendering, and gaming mostly care about.   What is the fastest chip I can get at a reasonable cost within a reasonable TDP limit.  

That said, curse you lack of L3 cache.   Still looking to be decent though,  you basically get the equivalent of what, like a 4.3-4.5ghz bulldozer, so like 3ghz SB or thereabouts.  Though I do say I was expecting more IPC.  Ah well.

How much longer till we see 3rd party reviews again?


----------



## Zubasa (Apr 10, 2012)

Oberon said:


> In the face of EVERY OTHER difference? No, not really.


Like what difference? The fact that the Bulldozer have crap performance/watt also?
If they manage to get enough Performance/Watt and clock speed I wouldn't care too much about IPC, but the fact is they could not.


----------



## Xajel (Apr 10, 2012)

Oberon said:


> *sigh* More "RAWR, NEED MOAR IPC" nonsense.
> 
> Question: If I produced with a CPU with half the throughput per clock as BD but clocked it five times higher, would anyone really complain about IPC? Didn't think so. Unless you're doing architectural comparisons between mildly like architectures, nobody _actually_ cares about IPC.



Yes, If you clocked it five times higher, then TDP and temps will be high... just take a look back at Prescott !!


----------



## Vulpesveritas (Apr 10, 2012)

Zubasa said:


> Like what difference? The fact that the Bulldozer have crap performance/watt also?
> If they manage to get enough Performance/Watt and clock speed I wouldn't care too much about IPC, but the fact is they could not.



which says nothing for piledriver and if the leaks we have seen are real it is more power efficient than Llano. 
Which is the hope that remains.


----------



## Zubasa (Apr 10, 2012)

Vulpesveritas said:


> which says nothing for piledriver and if the leaks we have seen are real it is more power efficient than Llano.
> Which is the hope that remains.


That is a lot of "ifs", I wouldn't hold my breath given that Piledriver is Bulldozer based.
I would love to grab a Trinity notebook if it turns out to be as good as you hope it will be.


----------



## OneMoar (Apr 10, 2012)

Oberon said:


> In the face of EVERY OTHER difference? No, not really.



you mean the fact that Bulldozer is hotter running ,slower ,uses more power and costs more then the now 3 year old Deneb core ?  the fact that AMD has done nothing but hype and bullshit people lately only to fail to deliver when it matters ? and they continue to attempt to bluff there way out of a hole they dug because of bad management 
if so RIGHT ON!
I used to be a AMD guy then I read a very interesting post by a EX employee
AMD need to get there shit in gear or put simply they won't be around much longer at AMD current rate of screw-up's I estimate they won't be around in as little as 3 or 4 years  
ARM is gaining ground and will gain even more of a presence with the release of windows 8
Intels new chips are getting into the sub 50watt range and still kicking the shit out of AMD's top end parts and there iGPU's are getting faster every reversion the AMD  7 series was a JOKE yea .. there in trouble
and trinity isnt looking like its gonna pull AMD's ass out of the fire ...
AMD already has limited financial resources and they are bleeding out FAST


----------



## eidairaman1 (Apr 10, 2012)

This is about Trinity, not bulldozer, keep it on track people


----------



## seronx (Apr 10, 2012)

I like how everyone lies...


----------



## Vulpesveritas (Apr 10, 2012)

Zubasa said:


> That is a lot of "ifs", I wouldn't hold my breath given that Piledriver is Bulldozer based.
> I would love to grab a Trinity notebook if it turns out to be as good as you hope it will be.



Correct me if I'm wrong but isn't there supposed to be a 3.6ghz 65w A10 sku?
with that integrated 7660 GPU at 800mhz?
lol.  Wonder how those 35w parts will be in the end.


----------



## Over_Lord (Apr 10, 2012)

Barely anything. I just want the GPU, TRINITY bring it on!


----------



## eidairaman1 (Apr 10, 2012)

instead of trying to pass judgement on something why dont we wait till it comes out instead of bashing it ya know. Reviews here are legit


----------



## OneMoar (Apr 10, 2012)

eidairaman1 said:


> instead of trying to pass judgement on something why dont we wait till it comes out instead of bashing it ya know. Reviews here are legit



that is a little late 
personally at this stage I think we(the consumer) need to make it VERY CLEAR to AMD that son is disappoint and they need to step it up or step out 
AMD kept a fairly tight lid on bulldozer probably because they knew they where not gonna be able to keep up with there hype and it backfired ... 
when a manufacturer doesn't talk to you about there product's in development  it means one of two things 
either they are making some really great and don't want to let the competition in on it 
OR more often then not they are having trouble and are hoping you wont notice
the problem here is the majority of AMD's consumer base happen to know what the fuck they are talking about and wont have the wool pulled over there eyes very easily


----------



## Melvis (Apr 10, 2012)

All i want is a 8core or 10 or whatever it may be to perform around a 2600K or alittle more and id be very happy and upgrade to that, if not then i to might move to intel =/


----------



## seronx (Apr 10, 2012)

Melvis said:


> All i want is a 8core or 10 or whatever it may be to perform around a 2600K or alittle more and id be very happy and upgrade to that, if not then i to might move to intel =/


:shadedshu , It's like you don't even know how FX eight cores even score...


----------



## OneMoar (Apr 10, 2012)

Intels logic 
Great performance per-core > lets add some more of them 
Amd's logic MEH lets just throw some fake cores at the problem and hope it helps
kk enough AMD bashing for onenight off to bed with me


----------



## Vulpesveritas (Apr 10, 2012)

seronx said:


> :shadedshu , It's like you don't even know how FX eight cores even score...


Maybe he is looking for a rendering PC that won't break the bank and prefers AMD?


----------



## Thefumigator (Apr 10, 2012)

OneMoar said:


> Intels logic
> Great performance per-core > lets add some more of them
> Amd's logic MEH lets just throw some fake cores at the problem and hope it helps
> kk enough AMD bashing for onenight off to bed with me



Intel's logic is "lets punish resellers and OEMs with several menaces so they stop selling AMD CPUs until we have a competitive core, and then as the design will be superior it will sell itself, then lets improve the cores sequentially and flawlesly, and then lets add some more of them. If they put us under trial for illegal practices, we just pay the thingy and go on."

AMD's logic is "ok, we lost several thousand million dollars along the way since 2003 because of intel illegal practices, we could have made things better if those millions were available before. So instead of paying extra (unavailable) cash for a bunch of talented engineers, we hired a bunch of guys who barely knew what happened with netburst, so they made bulldozer and they took several years to develop it. Of course we got the money from intel after the trial, but it came late, when the damage was already done. Now we have to play catch up, so Intel strategy worked as a charm" :shadedshu "Of course buying ATI is a long term investment which is saving our asses today, and if we didn't sell spansion we would have had some serious extra revenue. Driving the company to the wrong direction also made us loose millions. And not to mention world economical crisis, which also has its side effects"


----------



## HumanSmoke (Apr 10, 2012)

eidairaman1 said:


> instead of trying to pass judgement on something why dont we wait till it comes out instead of bashing it ya know. Reviews here are legit


Yup. Hate all that passing-judgement-before-the-launch bs...


eidairaman1 said:


> Unbeatbly tired and yawning at the talks of how great kepler is. Reminds me of how effective the POTUS talks are of spurring the economy which are lies.



/amazing what 90 seconds of search brings up

Personally I won't believe a word about about Trinity/Piledriver core performance until I hear it from John Fruehe....oh wait


eidairaman1 said:


> your post here is offtopic


Happy now?


----------



## eidairaman1 (Apr 10, 2012)

your post here is offtopic


----------



## Tatty_One (Apr 10, 2012)

OneMoar said:


> that is a little late
> personally at this stage I think we(the consumer) need to make it VERY CLEAR to AMD that son is disappoint and they need to step it up or step out
> AMD kept a fairly tight lid on bulldozer probably because they knew they where not gonna be able to keep up with there hype and it backfired ...
> when a manufacturer doesn't talk to you about there product's in development  it means one of two things
> ...



I agree with much of what you have said here, however surprisingly AMD's market share since Bulldozer was released has actually increased (Q4 2011), more so in the server market (although full year figures show a slight decrease in Intel's Market share with a slight increase to AMD's.... across the whole form factors), albeit not by much interestingly though, intels dominance was reduced by the same amount as AMD's gain, suggesting at least that AMD is managing to appeal to some OEM's, if only a minority.


----------



## Dent1 (Apr 10, 2012)

IMO this article is fail. 

Trying to compare a budget laptop quadcore Trinity APU to a high end desktop octocore Bulldozer is utter fail.

Until I see a desktop Piledriver, without any gimps with full cache I'm not passing any judgement.

Edit:

Also, its great that a low end Trinity is spanking the Bulldozer, which is  promising for the full blown desktop Piledriver.


----------



## xenocide (Apr 10, 2012)

Dent1 said:


> IMO this article is fail.
> 
> Trying to compare a budget laptop quadcore Trinity APU to a high end desktop octocore Bulldozer is utter fail.
> 
> ...



It's more or less to allow a frame of reference.  Trinity APU's are going to use the same Piledriver cores as the FX line, just without the L3 Cache, and minor tweaks, as I understand it.  So assuming you can get a nice 5% bump from the tweaks, you can say that if the PD cores in APU's are 10-20% faster than BD cores, then the PD Cores in desktops should be 15-25% faster than the BD cores in desktops.

It's not so much comparing Laptop APU's to Desktop CPU's, as it is comparing what's under the hood and what it could mean for Desktop CPU's in the future from AMD.


----------



## eidairaman1 (Apr 10, 2012)

xenocide said:


> It's more or less to allow a frame of reference.  Trinity APU's are going to use the same Piledriver cores as the FX line, just without the L3 Cache, and minor tweaks, as I understand it.  So assuming you can get a nice 5% bump from the tweaks, you can say that if the PD cores in APU's are 10-20% faster than BD cores, then the PD Cores in desktops should be 15-25% faster than the BD cores in desktops.
> 
> It's not so much comparing Laptop APU's to Desktop CPU's, as it is comparing what's under the hood and what it could mean for Desktop CPU's in the future from AMD.




Your post makes sense! 

15-25% is definitely a better than nothing gain considering BD-FX patching for win 7 didnt do anything. BD-FX being a server chip under the hood. Piledriver and Trinity may verywell be totally aimed at the desktop user and not a workstation/server environment like BD-FX and Opteron 3200,4200,6200 are. (Despite 3200 apparently being faster than BD-FX)


----------



## joyman (Apr 10, 2012)

The improvements look ok, I just hope the price will be improved also in the right direction  So finally I can replace my 960T with some nice Piledriver chip.


----------



## Hustler (Apr 10, 2012)

xenocide said:


> then the PD Cores in desktops should be 15-25% faster than the BD cores in desktops.



So what...that will just bring them up to Phenom II levels of performance (maybe slightly faster)...still an utter fail.


----------



## ensabrenoir (Apr 10, 2012)

Anyfhing bd related = flame war ...those words should not be used together .  Hoping pd is actually better...pipe dream at most because itll make intel release something even more awesome.  I figure they have several aces they havent played because...well u know ..no reason to no comp.  Anyway carry on with the other worldly performance increase fantasies.


----------



## xenocide (Apr 10, 2012)

Hustler said:


> So what...that will just bring them up to Phenom II levels of performance (maybe slightly faster)...still an utter fail.



Phenom II level per thread performance, with substantially better overclocking, a greatly improved IMC, and support for tons of new Instruction Sets makes for a pretty solid chip, especially if the price stays low and they can get the power consumption under control.  It means the AMD FX line is no longer 13 junk CPU's and the FX-8120 which is a steal.


----------



## Fourstaff (Apr 10, 2012)

Still waiting for AMD to improve performance per thread and power consumption, without these two Intel will almost certainly get my custom instead.


----------



## Kärlekstrollet (Apr 10, 2012)

A high IPC isn't always that great, a *good* IPC that's energy effective and scales well with overclocking.
In the end does Trinity outperform Llano and draw less power, mission well done AMD.


----------



## Melvis (Apr 10, 2012)

seronx said:


> :shadedshu , It's like you don't even know how FX eight cores even score...



 Ummm i think i do, and the 50 benchmarks around on the web show it losses 95% against the 2600k so i don't know what ya trying to say here sorry?

And if ya ask have i ever played with Bulldozer then the answer is yes a 8120, so i know how they work and perform.

Bring it up to par with what the 2600K is in most benchmarks/games etc then id be happy, thats all im asking for, is that to much?


----------



## zenlaserman (Apr 10, 2012)

seronx said:


> I like how everyone lies...



I like how you constantly delude yourself.

What I don't like is how you try to spread it like a wet gremlin.


----------



## Atom_Anti (Apr 10, 2012)

Guys, good balanced CPU / GPU performance is more important than just a very fast CPU.


----------



## Fourstaff (Apr 10, 2012)

Atom_Anti said:


> Guys, good balanced CPU / GPU performance is more important than just a very fast CPU.



Which is why you get a fast cpu coupled with a fast gpu rather than caring about APUs


----------



## wiak (Apr 10, 2012)

well the reason trinity and liano uses more power is because they have real graphics, that can be compared to a mid range radeon hd


----------



## xenocide (Apr 10, 2012)

wiak said:


> well the reason trinity and liano uses more power is because they have real graphics, that can be compared to a mid range radeon hd



Llano didn't have any "mid-range" level iGPU's.  If I recall the best iGPU was on par for about an HD6450 or under best circumstances a 5570.  That's entry-level at best.  The Trinity APU's have a 7660D, which means it should be substantially better, but still nothing compared to what most people classify as "mid-range".


----------



## Aquinus (Apr 10, 2012)

xenocide said:


> Llano didn't have any "mid-range" level iGPU's.  If I recall the best iGPU was on par for about an HD6450 or under best circumstances a 5570.  That's entry-level at best.  The Trinity APU's have a 7660D, which means it should be substantially better, but still nothing compared to what most people classify as "mid-range".



It's better than an Intel HD 3000 iGPU.


----------



## Fourstaff (Apr 10, 2012)

Aquinus said:


> It's better than an Intel HD 3000 iGPU.



That would be a pointless argument, since that we expected Llano to be able to run newer games at decent quality, whereas if we are getting Intel's chip we know that we must get a discrete. I think 3rd gen APUs will be powerful enough to be considered an alternative to discrete, but as of now APUs are only powerful enough to run counter strike and farmville, with Trinity powerful enough to run everything on low.


----------



## Aquinus (Apr 10, 2012)

Fourstaff said:


> That would be a pointless argument, since that we expected Llano to be able to run newer games at decent quality, whereas if we are getting Intel's chip we know that we must get a discrete. I think 3rd gen APUs will be powerful enough to be considered an alternative to discrete, but as of now APUs are only powerful enough to run counter strike and farmville, with Trinity powerful enough to run everything on low.



...but that is what you pay for when you get a CPU + GPU for 100 USD. For how much you're paying, it's a good bargain, not a high-end solution.


----------



## alwayssts (Apr 10, 2012)

OneMoar said:


> ...more often then not they are having trouble and are hoping you wont notice the problem here is the majority of AMD's consumer base happen to know what the fuck they are talking about and wont have the wool pulled over there eyes very easily



First I was like: Holy fuck, punctuation rape Batman.

Then I was like: Man makes a very valid point.


----------



## Hustler (Apr 10, 2012)

xenocide said:


> Phenom II level per thread performance, with substantially better overclocking



Well seeing as a Bulldozer has to run at about 4.2Ghz just to match my Phenom II at 3.8Ghz, the fact that the average Bulldozer only goes up to about 4.5Ghz (with non exotic cooling methods) will still only make them about 10% faster, whilst at the same time using loads more power for very little return.

Even with these new refined PD cores @ 4.5Ghz, i would only be looking at about a 25% overall improvement over my Phenom II , whereas a 2500k @ 4.5Ghz would be nearer 40-50% faster than my Phenom II.


----------



## Vulpesveritas (Apr 10, 2012)

Hustler said:


> Well seeing as a Bulldozer has to run at about 4.2Ghz just to match my Phenom II at 3.8Ghz, the fact that the average Bulldozer only goes up to about 4.5Ghz (with non exotic cooling methods) will still only make them about 10% faster, whilst at the same time using loads more power for very little return.
> 
> Even with these new refined PD cores @ 4.5Ghz, i would only be looking at about a 25% overall improvement over my Phenom II , whereas a 2500k @ 4.5Ghz would be nearer 40-50% faster than my Phenom II.


And we don't know how high they can go yet.  And on a bright side, this may mean AMD sees the BD architecture as having higher max IPC if it does match PII, seeing as K12 STARS only had what, a 5% IPC boost at 32nm?


----------



## faramir (Apr 10, 2012)

HumanSmoke said:


> Personally I won't believe a word about about Trinity/Piledriver core performance until I hear it from John Fruehe....oh wait



Joke Fruehe


----------



## Patriot (Apr 10, 2012)

xenocide said:


> That table is odd, why are several CPU's listed twice when thers are only listed once?



So all are listed at turbo...maybe they did two runs?

see if you get anything more from the source... either way performance is nice for APU...especially without L3...

http://translate.google.com/transla...O8H&rls=org.mozilla:en-US:official&prmd=imvns


----------



## faramir (Apr 10, 2012)

Dent1 said:


> IMO this article is fail.
> 
> Trying to compare a budget laptop quadcore Trinity APU to a high end desktop octocore Bulldozer is utter fail.



Why ? Both mobile and desktop Trinity seem to perform the same as far as IPC rate is concerned, as is to be expected. This means the desktop part, running at higher frequency, will obviously be faster but both mobile and desktop parts enjoy same ~5% IPC improvement over desktop Bulldozer. The test itself is not multi-threaded ad the article refers do IPC improvements only.

Coudl it be that you "are fail", reading comprehension problems ?


----------



## Dent1 (Apr 10, 2012)

Hustler said:


> So what...that will just bring them up to Phenom II levels of performance (maybe slightly faster)...still an utter fail.



Your attempt at being slick is utter fail, because Phenom II isn't 25% faster than Bulldozer. 



faramir said:


> Why ? Both mobile and desktop Trinity seem to perform the same as far as IPC rate is concerned, as is to be expected. This means the desktop part, running at higher frequency, will obviously be faster but both mobile and desktop parts enjoy same ~5% IPC improvement over desktop Bulldozer. The test itself is not multi-threaded ad the article refers do IPC improvements only.
> 
> Coudl it be that you "are fail", reading comprehension problems ?



Nope. The person with reading comprehension fail is you? Because I was referring to the unreleased high end desktop Vishera not the low end desktop Trinity.


----------



## Super XP (Apr 10, 2012)

Complete nonesense. Nobody really knows how Piledriver is going to perform. All we can do is speculate at this time. I did hear the Piledriver will perform better CLOCK 4 CLOCK versus Bulldozer. All we can do now is wait.


----------



## Kreij (Apr 10, 2012)

Keep it on topic gents.


----------



## Aquinus (Apr 11, 2012)

I'm hoping to see some reasonable gains by putting GCN on an APU. Putting piledriver with fewer cores makes more room for the iGPU. You don't need a power house to play games as a casual gamer. Also think about what AMD is doing. Bulldozer's integer performance is pretty awesome, and the floating point performance of a GPU is also pretty awesome. AMD has a plan, just think about the phrase, "the future is fusion." Llano was step 1.


----------



## TheoneandonlyMrK (Apr 11, 2012)

Aquinus said:


> Also think about what AMD is doing. Bulldozer's integer performance is pretty awesome, and the floating point performance of a GPU is also pretty awesome. AMD has a plan, just think about the phrase, "the future is fusion." Llano was step 1



+1 thats what ive been saying , maths co processors revolutionised cpu's year dot +4(memory fades) and every cpu that couldnt keep up isnt around anymore , the gpu on die intergrated further then it is (its still very very useable) could enhance the Cpu beyond present cpu's entirely

yes yes ,be time yet tho


----------



## ensabrenoir (Apr 11, 2012)

snip. AMD has a plan said:
			
		

> YES IVE SEEN THIS  It stars Michael j Fox .....
> 
> But seriously if amd gets it out in time.... no .....a lot sooner intel releases wont make their  advances seem ..... so last generation.


----------



## xenocide (Apr 11, 2012)

Aquinus said:


> ...but that is what you pay for when you get a CPU + GPU for 100 USD. For how much you're paying, it's a good bargain, not a high-end solution.



Well only the highest end Trinity APU's will come with the 7660D iGPU, and those will be around $140-160 if the current line of Llano pricing sticks.  I expect the 7660D to be about as good as an HD5670, but no better than an HD6670.  It will be decent, and still better than HD4000, but not a clean replacement for a Discrete GPU just yet.



Aquinus said:


> I'm hoping to see some reasonable gains by putting GCN on an APU. Putting piledriver with fewer cores makes more room for the iGPU. You don't need a power house to play games as a casual gamer. Also think about what AMD is doing. Bulldozer's integer performance is pretty awesome, and the floating point performance of a GPU is also pretty awesome. AMD has a plan, just think about the phrase, "the future is fusion." Llano was step 1.



BD's Integer Performance is alright, but I wouldn't say it's awesome.  The FX-8150 on that chart sits around 2050 Int/GHz, compared to the A8-3870 which sits at about 2475 Int/GHz.  Obviously the 8150 is better for more cores and better overall performance, but the per core performance of BD is still lacking, and that's the problem, since so many day to day tasks only use a handful of threads, having a ton of cores when you only need 2-4 with good per core performance isn't optimal.


----------



## Vulpesveritas (Apr 11, 2012)

xenocide said:


> Well only the highest end Trinity APU's will come with the 7660D iGPU, and those will be around $140-160 if the current line of Llano pricing sticks.  I expect the 7660D to be about as good as an HD5670, but no better than an HD6670.  It will be decent, and still better than HD4000, but not a clean replacement for a Discrete GPU just yet.


Well, if it has about the performance of a 6670 it will be a fairly clean replacement for an entry level GPU.  Because right above that you have the 7750 XD.


----------



## xenocide (Apr 11, 2012)

Vulpesveritas said:


> Well, if it has about the performance of a 6670 it will be a fairly clean replacement for an entry level GPU.  Because right above that you have the 7750 XD.



And that is the ideal situation.  AMD has nothing to gain by selling CPU's\APU's which have iGPU's that can compete with the discrete ones they are trying to sell.  The best they can do from a business perspective is to make it so the iGPU's on their APU's end right where their discrete solutions begin.


----------



## Vulpesveritas (Apr 11, 2012)

xenocide said:


> And that is the ideal situation.  AMD has nothing to gain by selling CPU's\APU's which have iGPU's that can compete with the discrete ones they are trying to sell.  The best they can do from a business perspective is to make it so the iGPU's on their APU's end right where their discrete solutions begin.


exactly, and with a reasonable quad core packaged with it with an unlocked multi, it may turn out to be a decent budget gaming solution. 
And the 65w version will be a great chip for the general public, especially when GPU acceleration becomes more common.


----------



## Kantastic (Apr 11, 2012)

Dent1 said:


> Your attempt at being slick is utter fail, because Phenom II isn't 25% faster than Bulldozer.
> 
> 
> 
> Nope. The person with reading comprehension fail is you? Because I was referring to the unreleased high end desktop Vishera not the low end desktop Trinity.



Stop talking out of your ass about unreleased products.


----------



## Aquinus (Apr 11, 2012)

Kantastic said:


> Stop talking out of your ass about unreleased products.



I don't recommend trolling. Might want to stay on topic and only post if you have something to contribute.


----------



## sergionography (Apr 11, 2012)

ok 25% sounds all good, but it still barely catches up with phenom II
and im talking mobile here, an amd phenom II n660 from 2010 is faster than any dual core llano from 2011 by a mile, it runs at 3ghz on 35watt, thats just amazing
idk what amd was thinking but they sure should do some models with better cpu and less gpu for cases were discrete  are to be used
and  idk what is the case with amd but it almost seems like 32nm didnt bring any improvements other than smaller die size, amds 45nm seems by far superior if they can do 35watt with 3ghz on a mobile chip


----------



## Kantastic (Apr 11, 2012)

Aquinus said:


> I don't recommend trolling. Might want to stay on topic and only post if you have something to contribute.



Okay, I hear the unreleased, unbenched, never-before-seen by my own eyes Vishera will be 58.2323% faster than BD.


----------



## btarunr (Apr 11, 2012)

Dent1 said:


> Trying to compare a budget laptop quadcore Trinity APU to a high end desktop octocore Bulldozer is utter fail.



Yet "high end desktop octocore Bulldozer" fails in these tests vs. "budget laptop quadcore Trinity APU".



Dent1 said:


> Until I see a desktop Piledriver, without any gimps with full cache I'm not passing any judgement.



You missed the part where we were talking about single core performance. So Trinity's single-core performance won't differ from Vishera's. In fact, if K10 Thuban vs. Llano single-threaded math performance is any indicator (where Llano is equal or faster than K10 clock-for-clock, at math tests), Trinity's single-core performance can be safely taken as a marker of Piledriver architecture's performance. 



Dent1 said:


> Also, its great that a low end Trinity is spanking the Bulldozer, which is  promising for the full blown desktop Piledriver.



Again, this is a single-thread integer/FP performance test. Also, I wouldn't classify -1% to +8% as "spanking". 

The ~1.5 year old Core i5-2500K is "spanking" yet unreleased A10-5800K. 



Super XP said:


> Complete nonesense. Nobody really knows how Piledriver is going to perform.



If you read the article and not just the title, we really do know how Piledriver is going to perform (-1% to +8% that of Bulldozer, at single-threaded tests).


----------



## Dent1 (Apr 11, 2012)

Kantastic said:


> Stop talking out of your ass about unreleased products.



Elaborate.



btarunr said:


> Yet "high end desktop octocore Bulldozer" fails in these tests vs. "budget laptop quadcore Trinity APU".



Exactly, which was my original point, 4 core Trinity APU based on Piledriver isnt bad if its beating the desktop an 8 core Bulldozer. The highend desktop Vishera Piledriver can only improve things further.


----------



## Inceptor (Apr 11, 2012)

As has been said already, all you can take away from this little bit of unconfirmed information is that a Piledriver core is, apparently, on average,  performing a few percent better than a Llano core, and by extension better than a Bulldozer Zambezi core with access to L3 cache.
Extending it further, we can possibly add a few percent more once L3 cache is included, and higher clocks are applied, in a Vishera CPU.  That's all that was ever going to happen here boys and girls; that's enough to push _overall performance_ into the overclocked i5-2500K range or stock i7-2600 range. 
Maybe add another 2-5% increase if Vishera uses quad channel memory.

As for the usual trolling and troll-ish comments:
I understand that you all want to feel important and sound like you're being very expert and knowledgeable by repeating incendiary comments that you've read elsewhere either written by people you think are intelligent or because you thought it sounded _cool_ and made you sound _cool_.  But what's the point of the useless 'fail' textual noise?  You're referencing a fail based on the work and decisions of many people who were fired, and no longer work for AMD.  Quit it with all the false hurt, and overblown outrage at a cpu that you wouldn't have bought even if it had originally matched Intel.

As Aquinas said, the real question is how future iterations of this core will perform once the integer cores and iGPU are fully integrated with each other.  AMD chose, for better or worse, a long term plan, that may or may not be fruitful for them, but it looks very promising, even if it didn't turn out promising for the careers of many of the people responsible.


----------



## Atom_Anti (Apr 11, 2012)

Fourstaff said:


> That would be a pointless argument, since that we expected Llano to be able to run newer games at decent quality, whereas if we are getting Intel's chip we know that we must get a discrete. I think 3rd gen APUs will be powerful enough to be considered an alternative to discrete, but as of now APUs are only powerful enough to run counter strike and farmville, with Trinity powerful enough to run everything on low.



Well, I'm nut sure what you are talking about, but I enjoyed to play with GTA4 and NFS RUN in high settings with my AMD Llano laptop. With core I5 520 I could not even play GTA San Andreas in low settings.... Yeah only CPU power is the king.


----------



## xenocide (Apr 11, 2012)

Atom_Anti said:


> Well, I'm nut sure what you are talking about, but I enjoyed to play with GTA4 and NFS RUN in high settings with my AMD Llano laptop. With core I5 520 I could not even play GTA San Andreas in low settings.... Yeah only CPU power is the king.



Considering that i5-520m came out over a year and a half earlier than any Llano APU, that's not overly impressive.  That i5 was still a lot more powerful on the CPU side than your Llano APU, the saving grace is the GPU.  Nobody is (or should at least) be arguing that the iGPU on Llano\Trinity is nice, and makes gaming possible on entry level CPU's without the extra cost a discrete GPU.  The discussion (and News Post) is more about the performance of the CPU side of the equation, which is very lacking, and in real need of improvement.  

I will say this again, APU's are excellent for Laptops, but they are just not that great for PC's.  Once you remove the need for the iGPU, you're just left with an underperforming CPU.


----------



## ensabrenoir (Apr 11, 2012)

If amd releases anything comparable to an i5......and not just in naming with graphics built in  and at current price points.....in the next few months....they would own the   market.  If it takes two years its irrelevant.


----------



## Atom_Anti (Apr 11, 2012)

xenocide said:


> I will say this again, APU's are excellent for Laptops, but they are just not that great for PC's.  Once you remove the need for the iGPU, you're just left with an underperforming CPU.



Yes, I agree AMD Llano and Trinity is not for desktop, but definitely for laptops. However how cares desktops any more when the World is changing to mobile / protible devices??
So what I mean whatever is the CPU performance that means nothing without GPU power. If you add GPU near Intel, that will results thicker and warmer case with louder cooling and cost more.


----------



## xenocide (Apr 11, 2012)

Atom_Anti said:


> Yes, I agree AMD Llano and Trinity is not for desktop, but definitely for laptops. However how cares desktops any more when the World is changing to mobile / protible devices??
> So what I mean whatever is the CPU performance that means nothing without GPU power. If you add GPU near Intel, that will results thicker and warmer case with louder cooling and cost more.



People who think desktop PC's are the way of the dinosaur are a little delusional.  Mobile\Portable devices will always be heavily restricted by batteries and size.  Until Quantum Computers become the norm, there will always be a need for desktops.  Also, until GPGPU becomes the norm, CPU's will still be very necessary.  Most users run programs that only rely on the CPU, so having a CPU with great performance is generally more beneficial.  Something like an Ultrabook gives the illusion of super high performance, even if it's lacking in certain categories, and people love that.


----------



## Aquinus (Apr 11, 2012)

xenocide said:


> Until Quantum Computers become the norm



Didn't you get the memo that quantum computers are a hoax?


----------



## OneMoar (Apr 11, 2012)

k I am unsubbing from this thread until a mod hands out bans to the flamers here


----------



## eidairaman1 (Apr 12, 2012)

OneMoar said:


> k I am unsubbing from this thread until a mod hands out bans to the flamers here



u were one of them...


----------



## TheoneandonlyMrK (Apr 12, 2012)

I actually thought it had not gone too Bad, mayhap because i posted once before this

i cant wait for the ins and outs of whats going on to be known in a few / year ,what will the ps3 and nextbox have etc.

im looking forward to a promising Vishera, not an awe inspireing one but id imagine with the new clock mesh tech these Apu's will OC a fair bit, given a reasonable Vreg which is only going to be on the Fm2 Platform not soo much laptops but in that form i can well imagine a modern day console type gameing experience on most pc games, maybe better/ deff better then an xbox360 game , and i can see it running well with an OC.

and with L3 ,more modules/cores and a (important)later  possible stepping vishera could end up doing well, I only back Amd at any point due to the fact some are OT given i get 60-80Fps in any game on ultra(addmitedly with hybrid physx for the nv favoured games) with my main rig , an intel system may well do much better but My experience isnt as bad as some of you are making out i dont notice any wait times and these chips are going to perform better then my main rig does at min or should


----------



## sergionography (Apr 12, 2012)

theoneandonlymrk said:


> I actually thought it had not gone too Bad, mayhap because i posted once before this
> 
> i cant wait for the ins and outs of whats going on to be known in a few / year ,what will the ps3 and nextbox have etc.
> 
> ...



yes thats very true about the clocks, just imagine if a quad core piledriver can do 100watt tdp at 4.2ghz with half of the chip being a gpu clocked at 800mhz, just imagine how far can the quad core piledriver cores go without the gpu, or atleast how efficient they would be
from leaks it seems piledriver is 20% than bulldozer clock-clock so it almost finaly matches phenom II ipc, but offcourse clocks much higher.
once again the phenom/phenom II story going on, with piledriver being what bulldozer was meant to be(and some... or hopefully atleast)


----------



## Dent1 (Apr 12, 2012)

sergionography said:


> from leaks it seems piledriver is 20% than bulldozer clock-clock so it *almost finaly *matches phenom II ipc, but offcourse clocks much higher.



OK stop spreading false information - I know you didn't do it deliberately but all this false information needs to be nipped in the bud.

Phenom II is not 20% faster than Bulldozer clock for clock. It's ridiculous to think that. Obviously its application dependant, but on a good day when an application favours Phenom II's architecture we are talking about maybe 5% or less or within margin for error. Overall the Bulldozer is faster.


----------



## Aquinus (Apr 12, 2012)

Dent1 said:


> OK stop spreading false information - I know you didn't do it deliberately but all this false information needs to be nipped in the bud.
> 
> Phenom II is not 20% faster than Bulldozer clock for clock. It's ridiculous to think that. Obviously its application dependant, but on a good day when an application favours Phenom II's architecture we are talking about maybe 5% or less or within margin for error. Overall the Bulldozer is faster.



He said IPC, buddy. Just because BD has a lower IPC it doesn't mean that it doesn't run faster. Once BD's IPC is down to where it was with the Phenom II, there will be a lot more performance because bulldozer clocks that much higher than the Phenom IIs did.

Also all of those tasks that do better on the P2 are single threaded tasks and unoptimized floating point applications and even in both of these cases, the performance is acceptable.


----------



## Dent1 (Apr 12, 2012)

Aquinus said:


> He said IPC, buddy. Just because BD has a lower IPC it doesn't mean that it doesn't run faster. Once BD's IPC is down to where it was with the Phenom II, there will be a lot more performance because bulldozer clocks that much higher than the Phenom IIs did.
> 
> Also all of those tasks that do better on the P2 are single threaded tasks and unoptimized floating point applications and even in both of these cases, the performance is acceptable.



Yes, but those single threaded tasks don't do 20% better as sergionography implied.


----------



## eidairaman1 (Apr 12, 2012)

how about we wait for trinity to be in the hands of the reviewers and users here. Same with AM3+ Piledriver


----------



## nt300 (Apr 12, 2012)

btarunr said:


> On a rather disturbing note, the performance-per-GHz figures of Piledriver are trailing far behind K12 architecture (Llano, A8-3850), let alone competitive architectures from Intel.


Each and every design is different. Piledriver/Bulldozer is design for higher clock speed. Llano and K12 is not. Just like the Athlon 64 of past it needed less clock speed to beat out Pentium 4 that needed at least an extra 1000MHz to stay competative.


----------



## Vulpesveritas (Apr 12, 2012)

Quite true, both Llano and Trinity are performing @ about 2400 integer score/ GHZ, which is 20% lower than the i5 SB score.
So depending on pricing and the clocks for the lower end chips, Trinity may be fully competitive thanks to it's higher clocks and superior iGPU, especially with IB i3/pentium not coming out till Q3/Q4 and trinity coming out late Q1, early Q2.  
And if the earlier rumor of power efficiency is true as well, with it being 15% more power efficient than llano, and given BD OC'd rather decently, the unlocked parts I feel based on what information we currently have available to us at the moment show a great budget part.

Although it is still merely speculation until Wiz gets to do a review.


----------



## Aquinus (Apr 12, 2012)

Dent1 said:


> Yes, but those single threaded tasks don't do 20% better as sergionography implied.



At the same clock speed, I bet they did. Can we get a review of a Phenom II and BD at the same clock, HTT speeds, and memory speeds? It would answer this question very quickly.



nt300 said:


> This statement complete rubbish.



Stop trolling and only post if you have something useful to contribute.


----------



## Steevo (Apr 12, 2012)

nt300 said:


> This statement complete rubbish.



Apparently you missed the chart, or lack the understanding to read it.


They directly compare the FP performance per clock, and the A8 series is raping the A10 and bulldozer if the chart is real. 

In other words, AMD may have done nothing but tweaked the core a bit to conserve energy and increase the speed. All joking aside this new architecture is the P4 from AMD.


I keep thinking and saying their only saving grace will be GCN added to a quad core and software enhancement to offload the work to the much faster GPU, however I think they lack the manpower and drive to do it. So I am expecting mediocrity from their next chip after this too. Once they push for it, or pull back to a tweaked for efficency design they have a chance to gain the performance edge. 


AMD, please, make software to support your hardware, Intel did it for years, programs would see a Intel chip and optimize performance, you can do it too. I would rather have two cores dedicated to serving data to a on bard GPU with enough stream processors and cache to run full tilt  than have 8 cores total. Or do it in hardware, surely 10% die area is worth a exponential increase in performance.


----------



## Aquinus (Apr 12, 2012)

Steevo said:


> surely 10% die area is worth a exponential increase in performance.



Bulldozer modules scales almost linearly, hyper-threading does not. I keep telling people that single-threaded applications aren't the future, they're the past. I think people are pandering about things that won't matter in the future since nothing is optimized for BD. (Applications that use FMA3 on BD actually have sizable floating point speed improvements.)



Steevo said:


> AMD, please, make software to support your hardware, Intel did it for years, programs would see a Intel chip and optimize performance, you can do it too. I would rather have two cores dedicated to serving data to a on bard GPU with enough stream processors and cache to run full tilt than have 8 cores total.



Clearly you don't know how the development model works. Why would you prepare software for cutting edge hardware that the majority of people don't have. People use technologies when it benefits them, and it benefits software companies when people have hardware that can run their software. That means requiring something like FMA3 puts people who don't have SB or BD at a loss which only hurts the consumer and the software developer.


----------



## xenocide (Apr 12, 2012)

Aquinus said:


> Bulldozer modules scales almost linearly, hyper-threading does not.



That's because HTing isn't intended to serve the same function.  It's just there so the CPU can use previously unused resources to get some work done instead of idling.  Bulldozer modules do scale well, but the problem is shit scaling linearly is still just shit.  Plus it's not as though having 1,000 Cores is better than just 4 good ones for most people.



Aquinus said:


> I keep telling people that single-threaded applications aren't the future, they're the past. I think people are pandering about things that won't matter in the future since nothing is optimized for BD. (Applications that use FMA3 on BD actually have sizable floating point speed improvements.)



I don't think anyone want's Single-Threaded applications, they are more like an unfortunate reality.  This isn't like the Athlon X2 era when people were saying it didn't matter because only a handful of people even had Multi-Core CPU's, at this point just about everything comes with at LEAST a Dual-Core.  The issue is that Bulldozer CPU's only get an edge when there are more than 4 Threads and you're using a BD CPU with more than 4 cores.  Even then,  having such low per-core performance usually results in the Intel CPU's winning out.


----------



## Vulpesveritas (Apr 12, 2012)

Steevo said:


> Apparently you missed the chart, or lack the understanding to read it.
> 
> 
> They directly compare the FP performance per clock, and the A8 series is raping the A10 and bulldozer if the chart is real.
> ...


Apparently you don't see the point of BD/PD architecture.  The very idea behind it is going to sacrifice FP performance by sharing the FP unit between two cores.  Given that it is integer performance that matters to what the architecture is being geared for, thus is what is being improved.  Note that the integer performance is the same as Llano but clocks 33% higher.  Giving the quad core unlocked trinity only a 10% lower integer performance than the i5-2500k while having a superior iGPU.  Meaning that it -should- outperform a SB i3, and we won't see IB i3's until Q3/Q4 most likely, and therefore Trinity should hold a very good value spot for up to 6 months, and quite possibly remain competitive with the IB i3's thanks to it's unlocked variants and it's superior iGPU.

Why do I keep mentioning the iGPU?  Because AMD's long term plan is HSA, and dumping floating point math onto the iGPU.  And HSA functions -should- be available next year, with 22nm steamroller + GCN (expected to be possibly a 7750-equivalent) on die.  

Oh, and earlier possible leaks show Trinity to be more power efficient than Llano as well.



Also @ genocide, 20% lower performance / clock and 10% lower performance / watt = shit?  I see it as being less efficient and powerful, but it's not like it is only half as powerful.  (and I'm saying that based on the slide btw.  Given the A10 is a 100w part that is really a 95w CPU + GPU + 5w bridge chip and all.  Bulldozer was something of a fail, Piledriver isn't looking to be quite as bad. 

I have to agree on the appearance of a Phenom I / II again here.


----------



## Steevo (Apr 12, 2012)

Aquinus said:


> Bulldozer modules scales almost linearly, hyper-threading does not. I keep telling people that single-threaded applications aren't the future, they're the past. I think people are pandering about things that won't matter in the future since nothing is optimized for BD. (Applications that use FMA3 on BD actually have sizable floating point speed improvements.)
> 
> 
> 
> Clearly you don't know how the development model works. Why would you prepare software for cutting edge hardware that the majority of people don't have. People use technologies when it benefits them, and it benefits software companies when people have hardware that can run their software. That means requiring something like FMA3 puts people who don't have SB or BD at a loss which only hurts the consumer and the software developer.



At a linear rate of what? Adding more cores to processors doesn't directly improve performance as many threaded items are/have dependancies, so core 0 may be still processing a thread that core 1 needs the result of to start work. IPC is still extremely important, thinking otherwise is naive. 

So first you say the way is multi-core, and now you say they shouldn't prepare for multi-core systems? This makes no sense, if we are moving to a multi-core standard (we are) we need to have hardware/software resources to support it, and if developers aren't going to do it, AMD needs to.


----------



## Vulpesveritas (Apr 13, 2012)

Steevo said:


> At a linear rate of what? Adding more cores to processors doesn't directly improve performance as many threaded items are/have dependancies, so core 0 may be still processing a thread that core 1 needs the result of to start work. IPC is still extremely important, thinking otherwise is naive.
> 
> So first you say the way is multi-core, and now you say they shouldn't prepare for multi-core systems? This makes no sense, if we are moving to a multi-core standard (we are) we need to have hardware/software resources to support it, and if developers aren't going to do it, AMD needs to.



So given that it has 80-90% of the single core performance while scaling better, i would say there is an advantage.  Technically it's single thread performance / watt that is important than pure IPC.  Although they -usually- tend to go hand in hand vs clocks.


----------



## Dent1 (Apr 13, 2012)

Aquinus said:


> At the same clock speed, I bet they did. Can we get a review of a Phenom II and BD at the same clock, HTT speeds, and memory speeds? It would answer this question very quickly.



I've seen many comparison reviews of the two CPUs in question, at the same clock speed, and none was anything close to 20% average increase IPC over the Bulldozer.

I'd be happy to read a review which shows that claim. If anyone has external reading material feel free to post it.


----------



## sergionography (Apr 13, 2012)

Dent1 said:


> Yes, but those single threaded tasks don't do 20% better as sergionography implied.



they actualy do dude, just go and look at an fx4100 review and compare it to a phenom II 980BE, you are looking an an fx8150 which clocks up to 4.2 and has 8 cores thats why it does better than a typical quad core phenom II, but comparing a quad core bulldozer to a quad core phenom II it fails miserably
only in situations were new instructions sets are supported does bulldozer hold ground,p but in typical use its way behind clock-clock, and yes by 20% if not more
phenom II does 3ipc while bulldozer does 4ipc shared between 2 cores, and because it has such a long pipeline each cycle takes a longer time(which isnt bad because its kinda designed that way so the resources can feed the second core in the module while the first one is munching on data)but things didnt go so well and the latency is worse than expected


http://www.legitreviews.com/article/1766/17/

heres some of conclusion from legitreview, i wasnt talking out of my ass just so you know

"When it comes to performance we were shocked to see the AMD A8-3850 'Llano' processor and the Socket FM1 platform performing better than the AMD FX-4100 'Bulldozer' processor and the Socket AM3+ platform. We quickly found out that the FX-4100 was priced this low as it needed to be. The performance of the FX-4100 wasn't awful, but we didn't expect to see the AMD A6-3650 running at 2.6GHz to beat the AMD FX-4100 running at 3.6GHz in benchmarks like POV-Ray and Cinebench! "


----------



## Steevo (Apr 13, 2012)

Vulpesveritas said:


> So given that it has 80-90% of the single core performance while scaling better, i would say there is an advantage.  Technically it's single thread performance / watt that is important than pure IPC.  Although they -usually- tend to go hand in hand vs clocks.



Try between 50-90% better. 

http://www.google.com/url?sa=t&rct=...sg=AFQjCNHF-2kq4OaqEGHMZBN3z0kXh3zM8A&cad=rja


Intel's own research shows a lack of increase when needing to tie up additional resources to schedule and track data between cores, and result dependency. I agree the initial result of two to four cores is a significant increase as we can offload other threads from the core running our primary worker, or assign different processes to different cores, however the overhead cost starts degrading the performance with more cores. 


I was merely asking for a hardware thread handler, and if like Nvidias "hot clocks" it can run at 2 or 4 times the core speed it could easily dispatch and track resources, even handling the offload of work to the GPU cores for faster processing. I understand the unified memory and number of threads/different type of work makes it difficult, but compared to making mediocre processors blazingly fast, what downside is there? If it added 25W of heat but was only used on enthusiast grade processors I would still buy it, as would many.


----------



## Vulpesveritas (Apr 13, 2012)

Umm BD is not 1/10th the processing power of SB, try 70-80% of the power.
That said, I'm fairly sure. Many of us would buy higher clocked chips up to 250w, and it would still sell among us.  Given it's still able to be air cooled and all and most mid tower cases supporting 120mm tower coolers.  Just leave a cooler out of the box, or give an option for a 155-165mm tall cooler with push-pull fans and a good design and we're set.


----------



## sergionography (Apr 16, 2012)

Steevo said:


> Try between 50-90% better.



^this


----------



## Aquinus (Apr 16, 2012)

xenocide said:


> That's because HTing isn't intended to serve the same function. It's just there so the CPU can use previously unused resources to get some work done instead of idling. Bulldozer modules do scale well, but the problem is shit scaling linearly is still just shit. Plus it's not as though having 1,000 Cores is better than just 4 good ones for most people.



I feel like all of my posts in this thread were just posted back to me...



Steevo said:


> I was merely asking for a hardware thread handler, and if like Nvidias "hot clocks" it can run at 2 or 4 times the core speed it could easily dispatch and track resources, even handling the offload of work to the GPU cores for faster processing. I understand the unified memory and number of threads/different type of work makes it difficult, but compared to making mediocre processors blazingly fast, what downside is there? If it added 25W of heat but was only used on enthusiast grade processors I would still buy it, as would many.



I think you're confusing how GPUs and CPUs work.

nVidia can do what they do because GPUs dispatch large workloads and runs a calculation on every shader that has data. CPUs don't work like this because you're not bulk processing the same instruction across a ton of data. You have different instructions being run, therefore what you're describing for a CPU is essentially a pipeline, which CPUs already have, but "dispatching" anything will result in less performance in single-threaded instances.

Do you know the basic 4 operations that almost any general purpose CPU does? Not to over-simplify how long a pipeline is, but basically you: LOAD, DECODE, EXECUTE, AND STORE, in that order. At this level, there is no parallelism, is's very step by step in the sense that you can't decode an instruction before you load it, you can't execute an instruction until it has been decoded, and you can't store the result after the instruction has been executed.


----------



## Steevo (Apr 16, 2012)

Yes I am aware, as I am in the process of getting my degree in computer science. C++, Networking, and other classes. 

A single thread on a CPU might run the four, but if we have a hardware scheduler that reads ahead and prefetches data "branching" and then performs the decode at twice the rate, programs shaders to do the work, and then they execute it and store it in the contiguous memory pool what difference does it make if the CPU transistors do it, or if the same instruction is run 5,000 times in the program, the GPU transistors do it. 

Pretty simple actually, GPU's already do 90% of this work to keep up with demand. The hardest part would be resource tracking, but again, if they solve it and the performance increase is only 25% better on average, they win.


----------



## Aquinus (Apr 17, 2012)

Steevo said:


> A single thread on a CPU might run the four, but if we have a hardware scheduler that reads ahead and prefetches data "branching" and then performs the decode at twice the rate, programs shaders to do the work, and then they execute it and store it in the contiguous memory pool what difference does it make if the CPU transistors do it, or if the same instruction is run 5,000 times in the program, the GPU transistors do it.



Except you can't process a regular application through a pipeline like a GPU has because GPU data is all the same where a computer program has multiple different instructions per clock cycle. A GPU is given a large set of data and told to do a single task to all of it, so it does it the same way. A CPU is instruction after instruction, there isn't a whole lot that represents what the GPU can do.

A shader is small because it has a limited number of instructions it can perform and has no control mechanism, no write back. There is no concept of threads in a GPU, it is an array of one or more sets of data that will have the same operation performed on the entire set. A shader is also SIMD, not MIMD as you're describing.

Where a CPU can carry out instructions like "move 10 bytes from memory location A to memory location B," A GPU does something more like "multiply every item in the array by 1.43."



Steevo said:


> Pretty simple actually, GPU's already do 90% of this work to keep up with demand. The hardest part would be resource tracking, but again, if they solve it and the performance increase is only 25% better on average, they win.



If it is so simple, why hasn't anyone else figured it out, I'm sill convinced that you don't quite know what you're talking about.



Steevo said:


> Yes I am aware, as I am in the process of getting my degree in computer science. C++, Networking, and other classes.



I do have a bachelors degree in computer science not to mention I'm employed as a systems admin and a developer.


----------



## sergionography (Apr 28, 2012)

Aquinus said:


> Except you can't process a regular application through a pipeline like a GPU has because GPU data is all the same where a computer program has multiple different instructions per clock cycle. A GPU is given a large set of data and told to do a single task to all of it, so it does it the same way. A CPU is instruction after instruction, there isn't a whole lot that represents what the GPU can do.
> 
> A shader is small because it has a limited number of instructions it can perform and has no control mechanism, no write back. There is no concept of threads in a GPU, it is an array of one or more sets of data that will have the same operation performed on the entire set. A shader is also SIMD, not MIMD as you're describing.
> 
> ...




true, unless u can get the cpu to sort things out and let the gpu do what its best. meaning the cpu can fetch and decode, then execution will be determined whether more efficient on the cpu or gpu
i think thats the approach amd is taking with apu's in the future(HSA)


----------



## TheoneandonlyMrK (Apr 28, 2012)

Aquinus said:


> Originally Posted by Steevo
> A single thread on a CPU might run the four, but if we have a hardware scheduler that reads ahead and prefetches data "branching" and then performs the decode at twice the rate, programs shaders to do the work, and then they execute it and store it in the contiguous memory pool what difference does it make if the CPU transistors do it, or if the same instruction is run 5,000 times in the program, the GPU transistors do it.
> 
> Except you can't process a regular application through a pipeline like a GPU has because GPU data is all the same where a computer program has multiple different instructions per clock cycle. A GPU is given a large set of data and told to do a single task to all of it, so it does it the same way. A CPU is instruction after instruction, there isn't a whole lot that represents what the GPU can do.
> ...



modern Gpu's are Mimd, 7970 and fermi /kepler and the 7970 and 680 have wright back support and hardware , as has been said ,the works being put in to make GPU's more usefull computationally, the hard bit, the thing they are doing well at is maintaining graphics performance

A stacked APU with four Steaming modules of two(8) plus 8gigpimp:dreamin) of on die memory and dual gcn GPU units wouls be a truely killer App


----------



## Steevo (Apr 28, 2012)

sergionography said:


> true, unless u can get the cpu to sort things out and let the gpu do what its best. meaning the cpu can fetch and decode, then execution will be determined whether more efficient on the cpu or gpu
> i think thats the approach amd is taking with apu's in the future(HSA)



That is exactly what I was talking about, fetch and decode and a hardware scheduler to push it to the GPU cores, or CPU cores based on type of work and load. Even if the GPU is twice as slow at one instruction if the CPU cores are busy and its not....


And if the GPU and CPU can read and write to shared cache so the GPU could execute instruction A, C, F, H, I, J, and M, while the CPU runs B which is dependent on A, D & E which are then stored for F to run two iterations of, the checked results are then stored back for the CPU to run G....so on and so forth.


----------



## sergionography (Apr 30, 2012)

Steevo said:


> That is exactly what I was talking about, fetch and decode and a hardware scheduler to push it to the GPU cores, or CPU cores based on type of work and load. Even if the GPU is twice as slow at one instruction if the CPU cores are busy and its not....
> 
> 
> And if the GPU and CPU can read and write to shared cache so the GPU could execute instruction A, C, F, H, I, J, and M, while the CPU runs B which is dependent on A, D & E which are then stored for F to run two iterations of, the checked results are then stored back for the CPU to run G....so on and so forth.



yeah and you are totaly right, even intel is sorta taking this approach as far as i know in haswell. however instead of doing it through gpu cores they are working on some sort of scheduler to make use of all the cores on the chip using hardware instead of software being optimized for multithread
so in essence it seems like both amd and intel are taking a new approach in fetching/decoding to let the hardware decide the execution part to be as efficient as possible, that is by far what the future of computing, to try to take advantage of every piece of silicon on the die, cuz even untill now i cant think of one software that will use even a core2quad or a phenom x4 at 100% on all 4 cores, and making cores fatter and fatter also can only get so far, even intel seems to be slowing down on that approach now that they realize fabrication process shrinking is getting to its limits, and is only getting more difficult


----------



## Aquinus (Apr 30, 2012)

Steevo said:


> That is exactly what I was talking about, fetch and decode and a hardware scheduler to push it to the GPU cores, or CPU cores based on type of work and load. Even if the GPU is twice as slow at one instruction if the CPU cores are busy and its not....
> 
> 
> And if the GPU and CPU can read and write to shared cache so the GPU could execute instruction A, C, F, H, I, J, and M, while the CPU runs B which is dependent on A, D & E which are then stored for F to run two iterations of, the checked results are then stored back for the CPU to run G....so on and so forth.



Take a computer architecture course and you will understand why this is not feasible. Not just with the hardware but x86 as well. First of all your logic fails when you consider the application. Your GPU *does not execute x86 instructions* it is informed what to do, it does something and gives it back. The video card doesn't just do something, it does the same thing to everything in the buffer provided to it.

What if Instruction B depends on data from instruction A? or F depends on data from instruction E? Between the PCI-E latency, time it takes for the GPU to execute the instruction, store it, then send it back over PCI-E you just hammered your performance by 10 fold. GPUs aren't designed for procedural code. They're designed to process large amounts of data in a similar fashion and I think you're confusing what a GPU can actually do. GPUs process many kilobytes to many megabytes of data per every instruction, not just two operands.

Learn what you're talking about before you start saying that something can be done when people who do this for a living and have had 8+ years of schooling to do this stuff. Honestly, what you're describing isn't feasible and I think I pointed this out before.


----------



## sergionography (Apr 30, 2012)

Aquinus said:


> Take a computer architecture course and you will understand why this is not feasible. Not just with the hardware but x86 as well. First of all your logic fails when you consider the application. Your GPU *does not execute x86 instructions* it is informed what to do, it does something and gives it back. The video card doesn't just do something, it does the same thing to everything in the buffer provided to it.
> 
> What if Instruction B depends on data from instruction A? or F depends on data from instruction E? Between the PCI-E latency, time it takes for the GPU to execute the instruction, store it, then send it back over PCI-E you just hammered your performance by 10 fold. GPUs aren't designed for procedural code. They're designed to process large amounts of data in a similar fashion and I think you're confusing what a GPU can actually do. GPUs process many kilobytes to many megabytes of data per every instruction, not just two operands.
> 
> Learn what you're talking about before you start saying that something can be done when people who do this for a living and have had 8+ years of schooling to do this stuff. Honestly, what you're describing isn't feasible and I think I pointed this out before.



gpus being unable to process x86 is years ago, when gpus only did graphical tasks, with todays architectures gpus are very much capable of gpgpu(general purpose computing), amd designed GCN with compute in mind, so did nvidia with fermi
however you have a point and these are pretty much the challenges that both intel and amd go through, but it is possible though
also you need to remember that we are not talking about using a gpu and forgetting the cpu, we are talking about both working together, meaning certain kinds of instructions like the one u stated would most likely be crunched on the cpu, while things like floating point operations would be done by the gpu
also note that there will be no more pci express linking the gpu and cpu when both are integrated together
it seems like you are talking about having a discrete gpu and a cpu doing the method that steevo mentioned but that is NOT the case, we are talking about gpu/cpu integrated in the architecture level


----------



## Aquinus (Apr 30, 2012)

sergionography said:


> gpus being unable to process x86 is years ago, when gpus only did graphical tasks, with todays architectures gpus are very much capable of gpgpu(general purpose computing), amd designed GCN with compute in mind, so did nvidia with fermi
> however you have a point and these are pretty much the challenges that both intel and amd go through, but it is possible though
> also you need to remember that we are not talking about using a gpu and forgetting the cpu, we are talking about both working together, meaning certain kinds of instructions like the one u stated would most likely be crunched on the cpu, while things like floating point operations would be done by the gpu
> also note that there will be no more pci express linking the gpu and cpu when both are integrated together
> it seems like you are talking about having a discrete gpu and a cpu doing the method that steevo mentioned but that is NOT the case, we are talking about gpu/cpu integrated in the architecture level



Okay, so the iGPU is linked to the IMC/North Bridge in a Llano chip. The issue is the CPU is still only seeing the iGPU as a GPU. The CPU has no instructions for telling the GPU what to do, everything still goes through the drivers at the software level.

I'm not saying that this is the way things are moving, but with how GPUs and CPUs exist now, there isn't enough coherency between the two architectures to really be able to do as you describe. The CPU still has to dispatch GPU instructions at the software level because there are no instructions that says, use the GPU cores to calculate blar.

Also keep in mind that for a single-threaded, non-optimized application, you have to wait for the last operation to complete before being able to execute the next one. Now this isn't true for modern super-scalar processors, however if the next instruction requires the output of the last, then you have to wait. You can't just execute all the instructions at once. It's not a matter of can it be done, it's a matter of how practical would it be. Right now with how GPUs are designed, providing large sets of data to be calculated at once is the way to go.

What you're describing is some kind of computational "core" that has the ability to do GPU-like calculations on a CPU core and the closest thing to what you're describing is Bulldozer (and most likely its sucessors which will be more like what you're describing) where a Bulldozer "module" contains a number of components for executing threads concurrently where a CPU can have an instruction that computed a Fast Fourier Transform on a matrix (array of data,) where the CPU would have control over so many ALUs to do the task at once where on the other hand, if multiple different ALU operations are being performed at once, only that number of resources are used.

This is what AMD would like to do and BD was step 1.

The issues are:
No software supports it.
No compiler supports it.
No OS supports it.

Could it be fast: Absolutely.
Could is be feasible in the near future: I don't think so.

I think there will be a middle ground in between a CPU and GPU core when all is said and done, we just haven't got there yet and I think it will be some time before we do. I'm just saying what you're describing between a unique CPU and a unique GPU core isn't feasible with how CPUs with an iGPU work and even GCN doesn't overcome these issues.


----------



## Steevo (Apr 30, 2012)

CUDA/GCN is a prime example of what we CAN do, and yes, Windows DX11 supports both as run under the drivers, or under OpenCL by itself. 

The diea to make a GPU with X86/X64 compute has been realized, now it is just the memory addressing between two independent processors (CPU & GPU) that makes it hard to do, but with both on one die and one hardware controller that runs controlling dispatches for both......


----------



## Aquinus (Apr 30, 2012)

Steevo said:


> one hardware controller that runs controlling dispatches for both......



Memory is becoming generalized, correct. What the instructions are, how they are dispatched, and how long they take to run is not. That is what what you're describing and it cannot be done with current iGPU solutions, at least not well enough to be worth it. The iGPU might be on the same die, but computationally speaking, there is nothing that couples CPU cores to the GPU. Also OpenCL requires drivers that support it, you can't just use OpenCL without telling the computer how to do it, which is what the drivers do. Basically (using Llano as an example,) the iGPU had the PCI-E/Memory bus replaced by the CPU's north bridge and that is it. Like I said before, GPUs don't work the same way as a CPU and they're not close enough to do what you're describing.

Also once again, as I said, if it really was as easy as your describe it, someone who has been working on this for years with doctoral degrees in either Comp Sci or EE would figure it out before you and since it practically doesn't exist, I'm very certain that you're describing is purely theoretically without any real-world background to back up such claims.


----------



## Steevo (Apr 30, 2012)

What part of a "new hardware controller to perform those functions" do you NOT understand. 


We have the technology, we have the capability, it will be hard, and require a few spins to get right but even a 25% IPC improvement is well worth it. 


So Hardware that is in the die between the RAM and L3 and can read and write to either, and also allows DMA for both CPU and GPU, while maintaining the TLB, reading the branching and decoding instructions and issuing them to the proper channel, and or even to the next available core, (Think hardware level multithreading of single threaded apps).......



Sounds complex until you realize that half the issues were dealt with when we moved away from PIO and to DMA.


The primary issue of data could be overcome by the controller issuing thread A to core 0 and assigning thread B to GPU subunit 1

Operation X waits on the results of both, the next instruction Y is fetched and core 1 is assigned the two corresponding memory addresses that will hold the result of thread A and Thread B, programmed to perform a multiply of the two results and save to a new memory address. 

This clock cycle is over and now core 1 is able to perform the work while the hardware dispatcher marks all the locations dirty involved in the previous operation, starts issuing the next set of commands including wipe the previous locations used.

It would require the use of a X64 environment optimally, resulting in a bit of overallocation of cache, unless we knew that we wouldn't need or use any registers that large. 

Anyway, the point is, we could do it, we are moving to it, and it is going to happen.


----------



## Aquinus (Apr 30, 2012)

Steevo said:


> The primary issue of data could be overcome by the controller issuing thread A to core 0 and assigning thread B to GPU subunit 1



The GPU has no concept of threads or sequential code. The CPU dispatches these things and the GPU does it. I'm not saying it doesn't work, I'm saying it doesn't work the way you're describing.


----------



## sergionography (Apr 30, 2012)

Aquinus said:


> Memory is becoming generalized, correct. What the instructions are, how they are dispatched, and how long they take to run is not. That is what what you're describing and it cannot be done with current iGPU solutions, at least not well enough to be worth it. The iGPU might be on the same die, but computationally speaking, there is nothing that couples CPU cores to the GPU. Also OpenCL requires drivers that support it, you can't just use OpenCL without telling the computer how to do it, which is what the drivers do. Basically (using Llano as an example,) the iGPU had the PCI-E/Memory bus replaced by the CPU's north bridge and that is it. Like I said before, GPUs don't work the same way as a CPU and they're not close enough to do what you're describing.
> 
> Also once again, as I said, if it really was as easy as your describe it, someone who has been working on this for years with doctoral degrees in either Comp Sci or EE would figure it out before you and since it practically doesn't exist, I'm very certain that you're describing is purely theoretically without any real-world background to back up such claims.



yes but with HSA there won't be no more gpu and gpu, the line between gpu and gpu is blurring u till they become one, mushed together to pretty much build a gpu with gpu capabilities or vice versa, integer cores and graphics cores will be integrated in the architecture level and prolly sharing the same l3-l2 cache for all we know, I'm no expert in the bitty details but I've read about it a bit
Also we have to note that AMD is developing a compiler for this, that's what the research wd heard about some university research getting performance must by software optimization or something, that turned out to be research for making a new compiler


----------



## Aquinus (Apr 30, 2012)

sergionography said:


> yes but with HSA there won't be no more gpu and gpu, the line between gpu and gpu is blurring u till they become one, mushed together to pretty much build a gpu with gpu capabilities or vice versa, integer cores and graphics cores will be integrated in the architecture level and prolly sharing the same l3-l2 cache for all we know, I'm no expert in the bitty details but I've read about it a bit
> Also we have to note that AMD is developing a compiler for this, that's what the research wd heard about some university research getting performance must by software optimization or something, that turned out to be research for making a new compiler



But we're not there yet.  You're seeing what happens to AMD when they try to change things, and you see what happens when Intel makes the same thing better. Also you will find that making a true HSA processor is no easy task. Hence why AMD started with Llano and Bulldozer.


----------



## sergionography (May 1, 2012)

Aquinus said:


> But we're not there yet.  You're seeing what happens to AMD when they try to change things, and you see what happens when Intel makes the same thing better. Also you will find that making a true HSA processor is no easy task. Hence why AMD started with Llano and Bulldozer.



yes true, but it's good they are taking it a step at a time, and i love how AMD is usually the brave one making the major change even tho Intel got all the money. Notice x86 computing today is pretty much amds making, it was AMD who pushed Intel to take the approach they are taking right now with high ipc, also remember amd 64, it was really good so intel supported it, if HSA turns out good it can also be a game changer for both teams


----------



## xenocide (May 1, 2012)

sergionography said:


> yes true, but it's good they are taking it a step at a time, and i love how AMD is usually the brave one making the major change even tho Intel got all the money. Notice x86 computing today is pretty much amds making, it was AMD who pushed Intel to take the approach they are taking right now with high ipc, also remember amd 64, it was really good so intel supported it, if HSA turns out good it can also be a game changer for both teams



Intel was well on it's way to a hard switch to 64-bit computing, which would have meant around the time the AMD Athlon 64's came out, everyone would make the jump to 64-bit.  It was awesome that AMD got x86-64 working for the sake of backwards compatability, but it's also the reason to this day we're still stuck with 32-bit programs when most of us have 64-bit computers.

AMD definitely was on the right path with High IPC, but with Bulldozer they decided to go all Netburst and start using lower IPC with substantially higher clocks.  Worked well for Intel back then right?


----------



## sergionography (May 1, 2012)

xenocide said:


> Intel was well on it's way to a hard switch to 64-bit computing, which would have meant around the time the AMD Athlon 64's came out, everyone would make the jump to 64-bit.  It was awesome that AMD got x86-64 working for the sake of backwards compatability, but it's also the reason to this day we're still stuck with 32-bit programs when most of us have 64-bit computers.
> 
> AMD definitely was on the right path with High IPC, but with Bulldozer they decided to go all Netburst and start using lower IPC with substantially higher clocks.  Worked well for Intel back then right?



no amd didnt go "low IPC"
they just decided to not only focus on ipc but also get more efficient execution, in theory a bulldozer module has 25% more compute capabilities than a phenom II core, but due to improper tuning and high latencies it takes a good 30-40% hit which puts its per core ipc behind phenom II, but if things were to work out perfect they would be years ahead, and im sure thats what amd thot aswell when they tested on simulated machines
there is no point for amd to go against intel in the "fat cores" with high IPC battle because if amd does so it will always be behind due to intel having more money which allows it to stay ahead in fabrication process, amd will have to take a more efficient architecture approach to get anywhere and thats where cores with shared resources came from(bulldozer was supposed to share the hardware that isnt always used by the core, and share it between 2 cores, and while integer cores would be crunching data in their cycle, the shared resources would be feeding the second core ) ,  and i think bulldozer was a good move in theory, just had very bad execution, what piledriver is turning out to be was supposed 2 be bulldozer. also there is nothing fundamentally wrong with bulldozer, they can add more decoders and fpu units to it and increase the modules ipc and with fine tuning sharing will have no affect to performance wat so ever, but it wasnt designed that way because bulldozer was meant to be for 45nm so adding more hardware would only make the modules too big
now if bulldozer was to compete with nehelem it would be seen as a much better cpu, but SB was the problem
remember bulldozer was years late to market


----------

