Friday, September 24th 2010
AMD Orochi ''Bulldozer'' Die Holds 16 MB Cache
Documents related to the "Orochi" 8-core processor by AMD based on its next-generation Bulldozer architecture reveal its cache hierarchy that comes as a bit of a surprise. Earlier this month, at a GlobalFoundries hosted conference, AMD displayed the first die-shot of the Orochi die, which legibly showed key features including the four Bulldozer modules which hold two cores each, and large L2 caches. In coarse visual inspection, the L2 cache of each module seems to cover 35% of its area. L3 cache is located along the center of the die. The documents seen by X-bit Labs reveal that each Bulldozer module has its own 2 MB L2 cache shared between two cores, and an L3 cache shared between all four modules (8 cores) of 8 MB.
This takes the total cache count of Orochi all the way up to 16 MB. This hierarchy suggests that AMD wants to give individual cores access to a large amount of faster cache (that's a whopping 2048 KB compared to 512 KB per core on Phenom, and 256 KB per core on Core i7), which facilitates faster inter-core, intra-module communication. Inter-module communication is enhanced by the 8 MB L3 cache. Compared to the current "Istanbul" six-core K10-based die, that's a 77% increase in cache amount for a 33% core count increase, 300% increase in L2 cache per core. Orochi is built on a 32 nm GlobalFoundries process, it is sure to have a very high transistor count.
Source:
Xbit Labs
This takes the total cache count of Orochi all the way up to 16 MB. This hierarchy suggests that AMD wants to give individual cores access to a large amount of faster cache (that's a whopping 2048 KB compared to 512 KB per core on Phenom, and 256 KB per core on Core i7), which facilitates faster inter-core, intra-module communication. Inter-module communication is enhanced by the 8 MB L3 cache. Compared to the current "Istanbul" six-core K10-based die, that's a 77% increase in cache amount for a 33% core count increase, 300% increase in L2 cache per core. Orochi is built on a 32 nm GlobalFoundries process, it is sure to have a very high transistor count.
152 Comments on AMD Orochi ''Bulldozer'' Die Holds 16 MB Cache
Although i admit i would prefer to go with amd as i had an amd k6 back in the day and was so happy with it and was lucky enough to go through 3 amd cpu's on my current motherboard so i would love to keep supporting them and if i am lucky go through another 2/3 cpu's on my next motherboard as i hate the idea of having to change motherboard every time i upgrade my cpu.
And yes, JF-AMD indeed is director of product marketing for servers at AMD. Waiting for W1zzard to give him his title. He may have known details about Orochi months before anyone else did.
Each Bulldozer module has two set of integer pipelines and both of them have dedicated 16kB L1D. 16+16kB in total per module, 16kB per thread. Bulldozer's L1I is 64kB, that's been public for some time now. About the bracketed comment; you think it could have been smaller, or you aren't sure what size it is? If you say so... And by coincidence, Intel is doing the same. "Obviously" they too must be patching Core m-arch's "poor L1s and L2s" by adding cache levels and continuously increasing their size. No. Orochi is 4 module, 8 thread core. Durrr...
Bulldozer does not have a 16MB L3, even reading the thread title should give away the L3 is 8MB. 2MB L2 + 2MB L3 per module, that is. Thus, per module, Orochi has 8× as much L2 vs. Nehalem and equal L3-ratio. Strange conclusion considering the public, (that includes me and you) don't know Bulldozer's exact pipeline length, yet. Broken sentence. What are you trying to say?
You do believe it is 20+ stage or you do not?
Also, the clock rates are completely unknown to public. Oh really? Now one can only wonder why didn't Intel see such a shortcoming of their L2 before taping out Nehalem, Sandy Bridge... They must have missed the fact their chips' L2 had shrinked to a fraction of the size compared to Conroe, Penryn.
PS.
In case you find some parts of my reply sarcastic, it is highly likely you are right.
Abstract for those with the "TL;DR" -syndrome:
Burger, please get your facts straight. The factual errors I've pointed out are public knowledge, go read them. And please do pay attention to writing proper English, often it is impossible to figure out what you're trying to say as many of your sentences are missing words and the words that are there are often misspelled.
JF-AMD, thank you for your contribution to the thread.
I'm really looking forward to Bulldozer and I hope it succeeds, both in Server and Desktop markets :)
That's just not true. Bulldozer's L1I and L2 are fully integrated parts of the BD module and they run at core freq, and no less. Bulldozer has 4T L1 latency, same as Nehalem's. Especially if the one "imagining things" is using incorrect numbers... What can I say, once again you astound (but not surprise) by posting utter nonsense. Feeling particularly "blue", perhaps? And by saying that I'm not referring to mood.
But what can you do, a troll is a troll is a troll.
Or it will be variation at the clock rate of the Orochi design?
Your nb's is quite good.
2nd, theese will be so diffrent compared to K8 K10 K10,5 that vmotion wont work from K10,5 -> bulldozer?
If we still can i'll be praising amd for my servers for a few more years! :P
Why it has about 10 mb/sec more sequential, better 4k, 512, and so on. its not by much.
But its getting beaten by both nvidia and intel.
www.tomshardware.com/reviews/ich10r-sb750-780a,2374-10.html
I just googled abit to find some review. never trusted toms too much, but yeh :p
Its not like i'm headbanging my head to the wall of my ssd performance, it's just: there are more to get here!
Now I want to make another question to JF-AMD which is related to the previous one. In the blog he mentions that from 33% more cores we take 50% more performance. The test was between magny-cours (12 core) and interlagos(16-core and bulldozer architecture). We will take the same ammount increase of perfomance and the client processors? Because the increase from 6 to 8 cores equals nearly 33, should we expect 50% perfomance jump form phenom II? If this happens will it comes with an equal increase at the price?
We have not disclosed L2 or L3 sizes, so whatever you quote is not confirmed, only speculation.
L1 is within the core. L2 is within the module. L3 is within the die.
Just as Intel asks $999 for its Extreme Edition SKUs, AMD used to ask for the same $999 for its FX SKUs (back when K8 was the best client CPU architecture out there). Even today AMD can try to ask for more than $275, if it wants to develop the QuadFX platform. Enthusiasts always have $999 to spend on one Core i7 or two DSDC-capable Phenom II chips in the s1207 package. It's just that AMD's client CPU team has to wake up to that realization. Power and board costs are lame excuses.
With servers you are measuring throughput, which is hom much stuff you can jam through a pipe at full utilization. Client loads are more bursty, so throughput is a less relevant measure.
:D
Now that I took another look on the released (heavily pixelated & manipulated) die shot, it might just be Bulldozer's L3 is either just 6MB, or whopping 12MB. That, or the cells in L3 are actually less dense than those in L2.