• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

GK104 Block Diagram Explained

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,407 (7.52/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
Specifications sheets of NVIDIA's GK104 GPU left people dumbfounded at the CUDA core count, where it read 1536, a 3-fold increase over that of the GeForce GTX 580 (3x 512). The block-diagram of the GK104, photographed at the NVIDIA press-meet by an HKEPC photographer, reveals how it all adds up. The GK104 is built on 28 nm fab process, with a die area of around 295 mm², according to older reports. Its component hierarchy essentially an evolution of that of the Fermi architecture.

The hierarchy starts with the GigaThread Engine, which marshals all the unprocessed and processed information between the rest of the GPU and the PCI-Express 3.0 system interface, below this, are four graphics processing clusters (GPCs), which holds one common resource, the raster engine, and two streaming multiprocessors (SMs), only this time, innovation has gone into redesigning the SM, it is called SMX. Each SMX has one next-generation PolyMorph 2.0 engine, instruction cache, 192 CUDA cores, and other first-level caches. So four GPCs of two SMXs each, and 16 SMXs of 192 CUDA cores each, amount to the 1536 CUDA core count. There are four raster units (amounting to 32 ROPs), 8 geometry units (each with a tessellation unit), and some third-level cache. There's a 256-bit wide GDDR5 memory interface.



View at TechPowerUp Main Site
 

the54thvoid

Super Intoxicated Moderator
Staff member
Joined
Dec 14, 2009
Messages
13,172 (2.39/day)
Location
Glasgow - home of formal profanity
Processor Ryzen 7800X3D
Motherboard MSI MAG Mortar B650 (wifi)
Cooling be quiet! Dark Rock Pro 4
Memory 32GB Kingston Fury
Video Card(s) Gainward RTX4070ti
Storage Seagate FireCuda 530 M.2 1TB / Samsumg 960 Pro M.2 512Gb
Display(s) LG 32" 165Hz 1440p GSYNC
Case Asus Prime AP201
Audio Device(s) On Board
Power Supply be quiet! Pure POwer M12 850w Gold (ATX3.0)
Software W10
Wish I knew what that all actually meant...:wtf:
 
Joined
Sep 25, 2007
Messages
5,966 (0.94/day)
Location
New York
Processor AMD Ryzen 9 5950x, Ryzen 9 5980HX
Motherboard MSI X570 Tomahawk
Cooling Be Quiet Dark Rock Pro 4(With Noctua Fans)
Memory 32Gb Crucial 3600 Ballistix
Video Card(s) Gigabyte RTX 3080, Asus 6800M
Storage Adata SX8200 1TB NVME/WD Black 1TB NVME
Display(s) Dell 27 Inch 165Hz
Case Phanteks P500A
Audio Device(s) IFI Zen Dac/JDS Labs Atom+/SMSL Amp+Rivers Audio
Power Supply Corsair RM850x
Mouse Logitech G502 SE Hero
Keyboard Corsair K70 RGB Mk.2
VR HMD Samsung Odyssey Plus
Software Windows 10
I think they made a good decision to cut individual core performance to fit more SP's in each SM, they did the same thing when they moved from the GT2XX series to the GT4XX series and it was worth it, and 1000Mhz stock is really suprising, I wonder what problems they ran into with the GK100 to prevent a release(Knowing Nvidia there just milking the market cause they know they have the faster cards maybe but oh well)
 
Joined
Jan 10, 2011
Messages
1,461 (0.29/day)
Location
[Formerly] Khartoum, Sudan.
System Name 192.168.1.1~192.168.1.100
Processor AMD Ryzen5 5600G.
Motherboard Gigabyte B550m DS3H.
Cooling AMD Wraith Stealth.
Memory 16GB Crucial DDR4.
Video Card(s) Gigabyte GTX 1080 OC (Underclocked, underpowered).
Storage Samsung 980 NVME 500GB && Assortment of SSDs.
Display(s) ViewSonic VA2406-MH 75Hz
Case Bitfenix Nova Midi
Audio Device(s) On-Board.
Power Supply SeaSonic CORE GM-650.
Mouse Logitech G300s
Keyboard Kingston HyperX Alloy FPS.
VR HMD A pair of OP spectacles.
Software Ubuntu 24.04 LTS.
Benchmark Scores Me no know English. What bench mean? Bench like one sit on?
Wish I knew what that all actually meant...:wtf:

Fermi is a V4 engine with 4 huge cylinders.
Kepler is a V8 with smaller ones..

I think....
 

Benetanegia

New Member
Joined
Sep 11, 2009
Messages
2,680 (0.48/day)
Location
Reaching your left retina.
Hmm so it's 192 SPs per SM(X), that's how they got to bundle so many of them. Plenty of warp schedulers and dispatchers to feed them too. I think it's a very sleek design, we'll have to wait and see what the efficiency is though and one downside is a relatively very small L1 cache.
 
Joined
Apr 19, 2011
Messages
2,198 (0.44/day)
Location
So. Cal.
it means it is slightly better than 7970
But you need to include those new "star" (that btarunr's word for it, not mine) technologies TXAA, Adaptive V-Sync which gives that "organic" framerate feel! What kinds of fertilizers are used for organic farming?
What was the Cheech and Chong routine about "feel"… feels like…
 

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,407 (7.52/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
It's a kickass GPU.

I hope I put it simple enough.

Hmm so it's 192 SPs per SM(X), that's how they got to bundle so many of them. Plenty of warp schedulers and dispatchers to feed them too. I think it's a very sleek design, we'll have to wait and see what the efficiency is though and one downside is a relatively very small L1 cache.

I'm hearing apart from high parallelization at the scheduler level, each small set of cores (lower level set than SMX) has a performance clock/voltage domain of its own. So not all 1536 CUDA cores will be running at the same clock speed (unless there's maximum or bare-minimum load). There will be hundreds of them running at countless combinations of clocks and voltages. It's as if the GPU knows exactly how much energy each single hardware resource needs at a given load.
 
Last edited:
Joined
Mar 10, 2010
Messages
11,880 (2.19/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s) Asus tuf RX7900XT /Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores laptop Timespy 6506
looks to be better then i expected, 4x setup and geometry engines and 8x total polymorph sounds like they may indeed pip the AMD crew this round , very modular too which should allow them(through binning) to have a reasonable range out quite quickly ,no mention of tesselation engines(fermi had 16 i think) ,have they incorporated that into their polymorph v2 or something
 

Benetanegia

New Member
Joined
Sep 11, 2009
Messages
2,680 (0.48/day)
Location
Reaching your left retina.
I'm hearing apart from high parallelization at the scheduler level, each small set of cores (lower level set than SMX) has a performance clock/voltage domain of its own. So not all 1536 CUDA cores will be running at the same clock speed (unless there's maximum or bare-minimum load). There will be hundreds of them running at countless combinations of clocks and voltages. It's as if the GPU knows exactly how much energy each single hardware resource needs at a given load.

That's amazing. Any word on what's the minimum set that can be disabled? I thought that an entire SMX? But with such control over clocks and voltage it could be a lower level I guess.

BTW that opens up an amazing oportunity for harvesting parts for the second SKU, though I'm not sure they'd do it or if it is desirable for us. Instead of requiring to clock (and voltage) the entire chip to the lower common denominator, it may be posible for them to clock only the parts that do not meet requirements lower, while the ones that can clock "normally" (high) could remain at the highest clock. It could be hard to implement and maybe even harder to make a SKU out of it, but on the tech level it would be amazing.
 
Joined
Apr 19, 2011
Messages
2,198 (0.44/day)
Location
So. Cal.
has a performance clock/voltage domain of its own. So not all 1536 CUDA cores will be running at the same clock speed (unless there's maximum or bare-minimum load)
Ok that's something... if they built into the chip level the idea of using Dynamic profiles or Adaptive V-Sync to shut down sections of cuda cores and are less dependent on just changing to core clock dramatically... this may as said be a "game changer" if Nvidia really implemented and made it integral at the chip level... they may have had this from the start, and not just some hocus pocus afterthought!
 

deleted

New Member
Joined
Jan 12, 2011
Messages
79 (0.02/day)
System Name Monolith
Processor i5 2500K, 4.6 GHz at 1.30v
Motherboard P8Z68-V Pro
Cooling CM Hyper 212+
Memory 2x4 GB G.Skill Ripjaws 1333 MHz CL9
Video Card(s) EVGA GTX 570, 920/1840/2050 at 1.100v
Storage SanDisk Extreme 240 GB, WD Caviar Black 1 TB
Display(s) LG W2363D
Case Silverstone FT02B
Audio Device(s) Creative Audigy 2
Power Supply Kingwin LZG-850
That's amazing. Any word on what's the minimum set that can be disabled? I thought that an entire SMX? But with such control over clocks and voltage it could be a lower level I guess.

BTW that opens up an amazing oportunity for harvesting parts for the second SKU, though I'm not sure they'd do it or if it is desirable for us. Instead of requiring to clock (and voltage) the entire chip to the lower common denominator, it may be posible for them to clock only the parts that do not meet requirements lower, while the ones that can clock "normally" (high) could remain at the highest clock. It could be hard to implement and maybe even harder to make a SKU out of it, but on the tech level it would be amazing.

i doubt they will disable less than an sm at a time. if the difference in hardware between two skus is less than 10 percent, then the difference in performance is almost guaranteed to be even less than that. no one is going to spend another 50 bucks to get 3 or 4 more fps.

also, the chance of independently setting the max clock rate for each sm is exactly nil. it might make for marginally higher yields, but it would be a net loss in productivity because of all of the testing that would have to occur. it would also pretty much kill overclocking. imagine trying to oc a card with 20 different clock speed sliders 20 separate voltage tables.
 
Last edited:

Benetanegia

New Member
Joined
Sep 11, 2009
Messages
2,680 (0.48/day)
Location
Reaching your left retina.
i doubt they will disable less than an sm at a time. if the difference in hardware between two skus is less than 10 percent, then the difference in performance is almost guaranteed to be even less than that. no one is going to spend another 50 bucks to get 3 or 4 more fps.

2 words: Lower clocks.

also, the chance of independently setting the max clock rate for each sm is exactly nil. it might make for marginally higher yields, but it would be a net loss in productivity because of all of the testing that would have to occur.

That is a valid point, requiring to have some kind of "profile" or qualification for each SM could be hard, which is why I did say it could be hard, but considering the huge amount of control that they already put there, I don't think it would be tremendously far-fetched to think about some kind of hardware automation for the next iteration, so that each SM can find (and report) its best clock and use the best voltage accordingly (if it does not do that already).

Anyway, I already questioned the feasibility of my comment regarding the posible SKUs. But tbh they could still make SKU based on "average" clock or average performance or something like that.

An example: imagine that the chip only had 2 SMs: 1 SM capable of 900 Mhz, 1 SM 1000 Mhz

1) Under normal conditions it would be a 900 Mhz SKU, because you have to limit the card to the lowest common denominator.
2) With dynamic clocking maybe it could be a 950 Mhz SKU, because that's the average clock both SMs would be running. Each chip would be different, but of course stock performance would be limited to a certain level, and that already occurs on current cards anyway.

it would also pretty much kill overclocking. imagine trying to oc a card with 20 different clock speed sliders 20 separate voltage tables.

Eehh... you didn't read what Btarunr said, right? You don't have to do anything, the chip does it by itself. You don't have to do 20 different sliders. There's just a main one like always and the chip finds which is best for each SM at any given time.
 
Joined
Apr 16, 2010
Messages
2,072 (0.38/day)
System Name iJayo
Processor i7 14700k
Motherboard Asus ROG STRIX z790-E wifi
Cooling Pearless Assasi
Memory 32 gigs Corsair Vengence
Video Card(s) Nvidia RTX 2070 Super
Storage 1tb 840 evo, Itb samsung M.2 ssd 1 & 3 tb seagate hdd, 120 gig Hyper X ssd
Display(s) 42" Nec retail display monitor/ 34" Dell curved 165hz monitor
Case O11 mini
Audio Device(s) M-Audio monitors
Power Supply LIan li 750 mini
Mouse corsair Dark Saber
Keyboard Roccat Vulcan 121
Software Window 11 pro
Benchmark Scores meh... feel me on the battle field!
Sound simply amazing......:eek:...fingers crossed that they actually deliver and its not all theory.... Future looking much greener. So gonna build something needlessly and senselessly over powered because... THE TECH IS THERE
 
Joined
Mar 10, 2010
Messages
11,880 (2.19/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s) Asus tuf RX7900XT /Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores laptop Timespy 6506
BTW that opens up an amazing oportunity for harvesting parts for the second SKU, though I'm not sure they'd do it or if it is desirable for us. Instead of requiring to clock (and voltage) the entire chip to the lower common denominator, it may be posible for them to clock only the parts that do not meet requirements lower, while the ones that can clock "normally" (high) could remain at the highest clock. It could be hard to implement and maybe even harder to make a SKU out of it, but on the tech level it would be amazing.

thats what i had just said:rolleyes:

:wtf:So let me get this right by all indications you can oc the gpu core parts (thatll just be the 4x setup and polymorphx8?) ,if there is any more oc headroom but in all likely hood wont be able to adjust shader speed:wtf: or its likely to be ineffective in that they may downclock anyway, me personally im not so keen on redundancy, max all the way every day :)

jebus wizz ,throw us a bone gdam it ,thumbs up to that cookie or not;)
 
Joined
Oct 29, 2010
Messages
2,972 (0.57/day)
System Name Old Fart / Young Dude
Processor 2500K / 6600K
Motherboard ASRock P67Extreme4 / Gigabyte GA-Z170-HD3 DDR3
Cooling CM Hyper TX3 / CM Hyper 212 EVO
Memory 16 GB Kingston HyperX / 16 GB G.Skill Ripjaws X
Video Card(s) Gigabyte GTX 1050 Ti / INNO3D RTX 2060
Storage SSD, some WD and lots of Samsungs
Display(s) BenQ GW2470 / LG UHD 43" TV
Case Cooler Master CM690 II Advanced / Thermaltake Core v31
Audio Device(s) Asus Xonar D1/Denon PMA500AE/Wharfedale D 10.1/ FiiO D03K/ JBL LSR 305
Power Supply Corsair TX650 / Corsair TX650M
Mouse Steelseries Rival 100 / Rival 110
Keyboard Sidewinder/ Steelseries Apex 150
Software Windows 10 / Windows 10 Pro
thats what i had just said:rolleyes:

:wtf:So let me get this right by all indications you can oc the gpu core parts (thatll just be the 4x setup and polymorphx8?) ,if there is any more oc headroom but in all likely hood wont be able to adjust shader speed:wtf: or its likely to be ineffective in that they may downclock anyway, me personally im not so keen on redundancy, max all the way every day :)

jebus wizz ,throw us a bone gdam it ,thumbs up to that cookie or not;)

Have a look at this link:

http://imgur.com/a/aQmuA#6n7nC

Here you'll find slides that I think were not posted here such as something about... overclocking!
 

deleted

New Member
Joined
Jan 12, 2011
Messages
79 (0.02/day)
System Name Monolith
Processor i5 2500K, 4.6 GHz at 1.30v
Motherboard P8Z68-V Pro
Cooling CM Hyper 212+
Memory 2x4 GB G.Skill Ripjaws 1333 MHz CL9
Video Card(s) EVGA GTX 570, 920/1840/2050 at 1.100v
Storage SanDisk Extreme 240 GB, WD Caviar Black 1 TB
Display(s) LG W2363D
Case Silverstone FT02B
Audio Device(s) Creative Audigy 2
Power Supply Kingwin LZG-850
Eehh... you didn't read what Btarunr said, right? You don't have to do anything, the chip does it by itself. You don't have to do 20 different sliders. There's just a main one like always and the chip finds which is best for each SM at any given time.
I did read what he said. What I was referring to is what you were talking about here:

An example: imagine that the chip only had 2 SMs: 1 SM capable of 900 Mhz, 1 SM 1000 Mhz

1) Under normal conditions it would be a 900 Mhz SKU, because you have to limit the card to the lowest common denominator.
2) With dynamic clocking maybe it could be a 950 Mhz SKU, because that's the average clock both SMs would be running. Each chip would be different, but of course stock performance would be limited to a certain level, and that already occurs on current cards anyway.
There's no way for the GPU to know at what clocks and voltages its stable. You have to test it and figure it out and tell it. If you're trying to overclock a card with asymmetrical maximum clock speeds and voltages, you're going to have to figure out the best clock and voltage for each SM. That's simply unfeasible. The way it's going to work is that you will determine a single max clock speed and voltage for the card, and it will underclock itself when it determines that it doesn't need the additional processing power.
 

Benetanegia

New Member
Joined
Sep 11, 2009
Messages
2,680 (0.48/day)
Location
Reaching your left retina.
There's no way for the GPU to know at what clocks and voltages its stable. You have to test it and figure it out and tell it. If you're trying to overclock a card with asymmetrical maximum clock speeds and voltages, you're going to have to figure out the best clock and voltage for each SM. That's simply unfeasible.

I'm not talking about OC, as in users OCing the cards, I never did. I'm talking about factory profiles and yes they are feasible.

The way it's going to work is that you will determine a single max clock speed and voltage for the card, and it will underclock itself when it determines that it doesn't need the additional processing power.

Kepler cards already do much more than that according to the info revealed, which once again makes me think that you have not read about it. When the card detects that power consumption is lower than a previously set value, it overclocks/ovevlots itself until the limit is reached.

The user, yes, only sets a base clock and voltage and the GPu sets a maximum boost clock based on that, then it goes up or down as required by GPU load and power consumption.
 
Joined
Feb 13, 2012
Messages
523 (0.11/day)
thats very interesting, tho it seems on par with 7970 without this fancy dynamic clock thing
7970 has a good 30% overclock headroom but you have to do it manually , nvidia will do so when needed
im assuming you will be able to set maximum clock rate on the kepler and it will max out when needed
kinda similar to turbo mode in cpus
tho setting a certain clock at all times might change everything
that being said im sure its gonna be tricky to review this thing! but cant wait to see the real benchmarks and how the kepler cores perform without that dynamic clock trick
 
Joined
Apr 5, 2011
Messages
21 (0.00/day)
Location
Indonesia
System Name Red Lightning
Processor Intel i5-3570K @4.0GHz
Motherboard Asrock Z77 Extreme4
Cooling Cooler Master Seidon 120V
Memory Gskill Ripjaws 8GB dual channel 1600MHz
Video Card(s) ASUS GTX 650 Ti
Storage PLextor M5S 128GB + WDC Black 1TB Sata3
Display(s) Dell SE2417HG
Case Cooler Master HAF 912 Advanced
Audio Device(s) Asus Xonar DG
Power Supply Seasonic M12-II 620W
Mouse Steelseries Rival 95
Keyboard Ozone Strike
Software Windows 10 Home
Joined
Aug 7, 2009
Messages
605 (0.11/day)
Processor Intel i7-940 @ 3.5Ghz
Motherboard Asus P6X58D-E
Cooling Corsair H70
Memory 12GB OCZ Platinum XTC DDR3 1600mhz CL7
Video Card(s) EVGA GTX 780ti
Storage Revodrive X2 240GB, 5TB HDD storage
Display(s) Asus PB278Q 27''
Case Antec Lanboy Air
Audio Device(s) Asus Xonar D2X
Power Supply Corsair HX850W
Software Windows 7 x64
The hierarchy starts with the GigaThread Engine, which marshals all the unprocessed and processed information between the rest of the GPU and the PCI-Express 3.0 system interface, below this, are four graphics processing clusters (GPCs), which holds one common resource, the raster engine, and two streaming multiprocessors (SMs), only this time, innovation has gone into redesigning the SM, it is called SMX. Each SMX has one next-generation PolyMorph 2.0 engine, instruction cache, 192 CUDA cores, and other first-level caches. So four GPCs of two SMXs each, and 16 SMXs of 192 CUDA cores each, amount to the 1536 CUDA core count. There are four raster units (amounting to 32 ROPs), 8 geometry units (each with a tessellation unit), and some third-level cache.

Oh, right.
 
Joined
Mar 10, 2010
Messages
11,880 (2.19/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s) Asus tuf RX7900XT /Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores laptop Timespy 6506
nearly,,, there:rockout::cry::(
 
Top