• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

CHOO CHOOOOO!!!!1! Navi Hype Train be rollin'

Rapid Packed Math is really simple: the FP32 FPUs can alternatively handle 2xFP16 in the same space/cycle.
Well that's it's initial implementation, later versions support lower bit ranges like 4x16bit 8x8bit 16x4bit and that's through 64bit wavefronts not 32 ,on 32 bit jobs it can still throughout 2x.

This is why Gcn isn't changing as soon as some would like.
 
Well that's it's initial implementation, later versions support lower bit ranges like 4x16bit 8x8bit 16x4bit and that's through 64bit wavefronts not 32 ,on 32 bit jobs it can still throughout 2x.
This is why Gcn isn't changing as soon as some would like.
Vega already has 1xFP32, 2xFP16, 4xINT8 and 8xINT4, so does Turing. Pascal should have everything besides 2xFP16.
Lower bit ranges have quite limited utility though and these have really not been used much in other than some ML applications.
 
Vega already has 1xFP32, 2xFP16, 4xINT8 and 8xINT4, so does Turing. Pascal should have everything besides 2xFP16.
Lower bit ranges have quite limited utility though and these have really not been used much in other than some ML applications.
They ,meaning Nvidia, do not have RPM , they Can do all of it ,but do some of it with special hardware ie tensor or RtRt core's and some is done by cuda core's but they're not doing it the same way at all.

I have a vega, i know what it can do.
 
CHOO CHOOOOO!!!!1! Navi Hype Train be rollin'

looks like thread title was written by a 5 year old.....
 
They ,meaning Nvidia, do not have RPM , they Can do all of it ,but do some of it with special hardware ie tensor or RtRt core's and some is done by cuda core's but they're not doing it the same way at all.
Yes, Nvidia has a different implementation. Does it matter all that much as long as the same featureset is there?
 
Yes, Nvidia has a different implementation. Does it matter all that much as long as the same featureset is there?
It does to Nvidia and Amd , but not so much to us no.
But in saying that Nvidia are making quite the big deal at the moment about what they're Special hardware can do aren't they.
 
But in saying that Nvidia are making quite the big deal at the moment about what they're Special hardware can do aren't they.
Well, it depends on the context or features/hardware in question.
Couple operations Nvidia implemented in hardware as RT Cores do seem to be somewhat worth hyping - doable in shaders definitely but RT Cores are clearly much more efficient at them.
Tensor cores are a question but it looks like Nvidia has been somewhat hush-hush about what these actually do. For example the part where FP16 is done (or can be done) on Tensor cores is worth noting but of the bigger sites Anandtech was the one that caught wind of it for their TU116 review. I would say this is interesting.
 
CHOO CHOOOOO!!!!1! Navi Hype Train be rollin'

looks like thread title was written by a 5 year old.....

I am 7 actually, sheesh. Age Descrimination. The thread was supposed to be fun (and a joke) because everybody is salty as fuck. Like you. Carry on.
 
Well, it depends on the context or features/hardware in question.
Couple operations Nvidia implemented in hardware as RT Cores do seem to be somewhat worth hyping - doable in shaders definitely but RT Cores are clearly much more efficient at them.
Tensor cores are a question but it looks like Nvidia has been somewhat hush-hush about what these actually do. For example the part where FP16 is done (or can be done) on Tensor cores is worth noting but of the bigger sites Anandtech was the one that caught wind of it for their TU116 review. I would say this is interesting.
So it does matter just only if it's Nvidia lauding it, anywho.

In the context of this thread we probably need to get more on topic.
 
I am 7 actually, sheesh. Age Descrimination. The thread was supposed to be fun (and a joke) because everybody is salty as fuck. Like you. Carry on.
LOL I'm not salty, just wish ppl could act and write more adult like.... This site has gone downhill forum wise the last few years...
 
LOL I'm not salty, just wish ppl could act and write more adult like.

I'm sorry you couldn't see the joke that it was. I have ordered a happy meal for you.
 
Tensor cores are a question but it looks like Nvidia has been somewhat hush-hush about what these actually do. For example the part where FP16 is done (or can be done) on Tensor cores is worth noting but of the bigger sites Anandtech was the one that caught wind of it for their TU116 review. I would say this is interesting.

For RTX TU, FP16 is exclusively a tensor op. GTX TU FP16 is interesting given no tensors according to NV. I'm not entirely convinced the hardware is very different. TU SM layout is more tightly packed than GP, but RTX/Tensor silicon appears to be only ~10% of the die. TU uarch is higher area consuming even without the RTX pipeline. Given RTX features only make sense with a minimum raster performance level (2060), I wouldn't be surprised if GTX TU had similar hardware but limited to fp16 ops. The big benefit of RTX tensor cores IMO is the FP32 accumulate for data science.
 
I am 7 actually, sheesh. Age Descrimination. The thread was supposed to be fun (and a joke) because everybody is salty as fuck. Like you. Carry on.

should be in General Nonsense
 
I'm not sure where you are going with this but thanks for the tip and that's what I said. Mixed precision. Anyway my confusion with you is about a different matter. let me ask you straight. I understand that tensor cores are AI for you or the deep learning or did I just understand you wrong cause that's my impression.
The add is the only one that supports FP32 and the reason for that is so that it is less likely to overflow the FP16*FP16 result. The main point (and why it is good for AI) is that it is a matrix solver for tensor flow. AMD doesn't have a matrix solver. GCN has to do these calculations on the shaders which is much, much slower. Example: Vega can do about 24 TFLOP FP16; Volta can do over 100 TFLOP FP16 in its tensor cores alone.

They ,meaning Nvidia, do not have RPM , they Can do all of it ,but do some of it with special hardware ie tensor or RtRt core's and some is done by cuda core's but they're not doing it the same way at all.

I have a vega, i know what it can do.
NVIDIA added parallelism to deal with the problem in Turing where AMD made Vega more flexible. As a result, Turing has a lot of transistors but more performance where Vega has fewer transistors but less performance.

AMD is going to want to compete in AI so AMD is going to have to add tensor cores eventually but I don't think that is in Navi because it was made for Sony who has no use for it.
 
NVIDIA added parallelism to deal with the problem in Turing where AMD made Vega more flexible. As a result, Turing has a lot of transistors but more performance where Vega has fewer transistors but less performance.

Not entirely the same. GCN makes no distinction between graphics & compute modes & can schedule concurrently. TU is better at this than GP et al, but parallelism is a function of running integer & floats at the same time. Just highlights the different uarch approaches. NV prefers discrete specialized silicon costing more die space, whereas AMD (til now) has preferred generalist alus.
 
Turing doesn't sacrifice anything (other than die space) for concurrent FP16 performance. Vega gets FP16 performance by taking away from FP32 performance. This is a disadvantage for Vega and an advantage for Turing when it comes to anything that can benefit from FP16.
 
The add is the only one that supports FP32 and the reason for that is so that it is less likely to overflow the FP16*FP16 result. The main point (and why it is good for AI) is that it is a matrix solver for tensor flow. AMD doesn't have a matrix solver. GCN has to do these calculations on the shaders which is much, much slower. Example: Vega can do about 24 TFLOP FP16; Volta can do over 100 TFLOP FP16 in its tensor cores alone.


NVIDIA added parallelism to deal with the problem in Turing where AMD made Vega more flexible. As a result, Turing has a lot of transistors but more performance where Vega has fewer transistors but less performance.

AMD is going to want to compete in AI so AMD is going to have to add tensor cores eventually but I don't think that is in Navi because it was made for Sony who has no use for it.
Nvidia couldn't easily put back 64bit compute, they had to go special hardware they added tensor cores after Google ditched their GPUs for their own tensor asic.

And just look how much use their specific hardware is generally, it's useless.
 
For games, mostly. Navi is a gaming product which is why I don't think it will have tensor cores. I would be shocked if Arcturus didn't have tensor cores because AMD is so far behind in machine learning. Then again, companies like Tesla are designing their own chips for machine learning anyway.

Point is: RPM doesn't help much with tensor flow where RTX's tensor cores do. DLSS isn't something Navi will have because it will lack the hardware to do it effectively.
 
If patterns are anything to go by, the AMD hype train for their GPU's are going to crash.
 
Turing doesn't sacrifice anything (other than die space) for concurrent FP16 performance.

What does "concurrent fp16" even mean? You are aware that half floats & RPM (2xfp16) are used instead of fp32 to increase performance of ops not requiring full float precision? It's a register/resource & throughput gain in the case of 2xfp16. Int32, Int16, transcendentals, etc, still happen in the SM. TU "concurrency" is the ability to pack both integer & floats in the pipeline without bubbles/stalls/context switching.

Vega gets FP16 performance by taking away from FP32 performance. This is a disadvantage for Vega and an advantage for Turing when it comes to anything that can benefit from FP16.

Frightening. You should read the TU uarch & mixed precision white papers.

Point is: RPM doesn't help much with tensor flow where RTX's tensor cores do. DLSS isn't something Navi will have because it will lack the hardware to do it effectively.

Tensor math is just 4x4 matrix FMA. It's the ability of the tensors to work on fp16, int8, int4 that makes them useful in nn ML. I asked someone else earlier: what do you think DLSS is?
 
If patterns are anything to go by, the AMD hype train for their GPU's are going to crash.
AMD has a bad marketing team. Instead of quelling down the rumors, they just let the rumor spread like wild fire. I was one of the victim of the Radeon 7 hype train. The fall from hype hurts so bad, I now consider EVERY rumor about Zen 2 and Navi as nothing but bad gossip. Frankly I don't want to be a part of a community that harbors and encourage spreading of bad information.
 
If one googles GFX1010:
//On GFX10 I$ is 4 x 64 bytes cache lines. By default prefetcher keeps one cache line behind and reads two ahead. We can modify it with S_INST_PREFETCH for larger loops to have two lines behind and one ahead. Therefor we can benefit from aligning loop headers if loop fits 192 bytes. If loop fits 64 bytes it always spans no more than two cache lines and does not need an alignment. Else if loop is less or equal 128 bytes we do not need to modify prefetch, Else if loop is less or equal 192 bytes we need two lines behind.

-> L0 cache, which is referred to below.

// In WGP mode the waves of a work-group can be executing on either CU of the WGP. Therefore need to invalidate the L0 which is per CU. Otherwise in CU mode and all waves of a work-group are on the same CU, and so the L0 does not need to be invalidated.

-> CU mode and WGP mode

// HWRC = Register destination cache
&
// Try to reassign registers on GFX10+ to reduce register bank conflicts.
// On GFX10 registers are organized in banks. VGPRs have 4 banks assigned in a round-robin fashion: v0, v4, v8... belong to bank 0. v1, v5, v9... to bank 1, etc. SGPRs have 8 banks and allocated in pairs, so that s0:s1, s16:s17, s32:s33 are at bank 0. s2:s3, s18:s19, s34:s35 are at bank 1 etc.
// The shader can read one dword from each of these banks once per cycle. If an instruction has to read more register operands from the same bank an additional cycle is needed. HW attempts to pre-load registers through input operand gathering, but a stall cycle may occur if that fails. For example V_FMA_F32 V111 = V0 + V4 * V8 will need 3 cycles to read operands, potentially incuring 2 stall cycles.
// The pass tries to reassign registers to reduce bank conflicts.
// In this pass bank numbers 0-3 are VGPR banks and 4-11 are SGPR banks, so that 4 has to be subtracted from an SGPR bank number to get the real value. This also corresponds to bit numbers in bank masks used in the pass.

-> HWRC and banking are part of Super-SIMD patents;
https://patents.google.com/patent/US20180357064A1
https://patents.google.com/patent/US20180121386A1

//In one embodiment, each bank of the vector destination cache holds 4 entries, for a total 8 entries with 2 banks.
-> destination register cache // HWRC => 8 destination registers with 3-entry source operand forwarding.

//In one embodiment, source operands buffer holds up to 6 VALU instruction's source operands. In one embodiment, source operand buffer includes dedicated buffers for providing 3 different operands per clock cycle to serve instructions like a fused multiply-add operation which performs a*b+c.
-> source operand buffer => 6 * 3-entry source operand buffer
 
Last edited:
Back
Top