Friday, March 16th 2012
GK104 Block Diagram Explained
Specifications sheets of NVIDIA's GK104 GPU left people dumbfounded at the CUDA core count, where it read 1536, a 3-fold increase over that of the GeForce GTX 580 (3x 512). The block-diagram of the GK104, photographed at the NVIDIA press-meet by an HKEPC photographer, reveals how it all adds up. The GK104 is built on 28 nm fab process, with a die area of around 295 mm², according to older reports. Its component hierarchy essentially an evolution of that of the Fermi architecture.
The hierarchy starts with the GigaThread Engine, which marshals all the unprocessed and processed information between the rest of the GPU and the PCI-Express 3.0 system interface, below this, are four graphics processing clusters (GPCs), which holds one common resource, the raster engine, and two streaming multiprocessors (SMs), only this time, innovation has gone into redesigning the SM, it is called SMX. Each SMX has one next-generation PolyMorph 2.0 engine, instruction cache, 192 CUDA cores, and other first-level caches. So four GPCs of two SMXs each, and 16 SMXs of 192 CUDA cores each, amount to the 1536 CUDA core count. There are four raster units (amounting to 32 ROPs), 8 geometry units (each with a tessellation unit), and some third-level cache. There's a 256-bit wide GDDR5 memory interface.
Source:
HKEPC
The hierarchy starts with the GigaThread Engine, which marshals all the unprocessed and processed information between the rest of the GPU and the PCI-Express 3.0 system interface, below this, are four graphics processing clusters (GPCs), which holds one common resource, the raster engine, and two streaming multiprocessors (SMs), only this time, innovation has gone into redesigning the SM, it is called SMX. Each SMX has one next-generation PolyMorph 2.0 engine, instruction cache, 192 CUDA cores, and other first-level caches. So four GPCs of two SMXs each, and 16 SMXs of 192 CUDA cores each, amount to the 1536 CUDA core count. There are four raster units (amounting to 32 ROPs), 8 geometry units (each with a tessellation unit), and some third-level cache. There's a 256-bit wide GDDR5 memory interface.
23 Comments on GK104 Block Diagram Explained
Kepler is a V8 with smaller ones..
I think....
What was the Cheech and Chong routine about "feel"… feels like…
I hope I put it simple enough. I'm hearing apart from high parallelization at the scheduler level, each small set of cores (lower level set than SMX) has a performance clock/voltage domain of its own. So not all 1536 CUDA cores will be running at the same clock speed (unless there's maximum or bare-minimum load). There will be hundreds of them running at countless combinations of clocks and voltages. It's as if the GPU knows exactly how much energy each single hardware resource needs at a given load.
BTW that opens up an amazing oportunity for harvesting parts for the second SKU, though I'm not sure they'd do it or if it is desirable for us. Instead of requiring to clock (and voltage) the entire chip to the lower common denominator, it may be posible for them to clock only the parts that do not meet requirements lower, while the ones that can clock "normally" (high) could remain at the highest clock. It could be hard to implement and maybe even harder to make a SKU out of it, but on the tech level it would be amazing.
also, the chance of independently setting the max clock rate for each sm is exactly nil. it might make for marginally higher yields, but it would be a net loss in productivity because of all of the testing that would have to occur. it would also pretty much kill overclocking. imagine trying to oc a card with 20 different clock speed sliders 20 separate voltage tables.
Anyway, I already questioned the feasibility of my comment regarding the posible SKUs. But tbh they could still make SKU based on "average" clock or average performance or something like that.
An example: imagine that the chip only had 2 SMs: 1 SM capable of 900 Mhz, 1 SM 1000 Mhz
1) Under normal conditions it would be a 900 Mhz SKU, because you have to limit the card to the lowest common denominator.
2) With dynamic clocking maybe it could be a 950 Mhz SKU, because that's the average clock both SMs would be running. Each chip would be different, but of course stock performance would be limited to a certain level, and that already occurs on current cards anyway. Eehh... you didn't read what Btarunr said, right? You don't have to do anything, the chip does it by itself. You don't have to do 20 different sliders. There's just a main one like always and the chip finds which is best for each SM at any given time.
:wtf:So let me get this right by all indications you can oc the gpu core parts (thatll just be the 4x setup and polymorphx8?) ,if there is any more oc headroom but in all likely hood wont be able to adjust shader speed:wtf: or its likely to be ineffective in that they may downclock anyway, me personally im not so keen on redundancy, max all the way every day :)
jebus wizz ,throw us a bone gdam it ,thumbs up to that cookie or not;)
imgur.com/a/aQmuA#6n7nC
Here you'll find slides that I think were not posted here such as something about... overclocking!
The user, yes, only sets a base clock and voltage and the GPu sets a maximum boost clock based on that, then it goes up or down as required by GPU load and power consumption.
& a rough comparison of power consumption bet. 7970 & 680, follow this link
7970 has a good 30% overclock headroom but you have to do it manually , nvidia will do so when needed
im assuming you will be able to set maximum clock rate on the kepler and it will max out when needed
kinda similar to turbo mode in cpus
tho setting a certain clock at all times might change everything
that being said im sure its gonna be tricky to review this thing! but cant wait to see the real benchmarks and how the kepler cores perform without that dynamic clock trick