CHOO CHOOOOO!!!!1! Navi Hype Train be rollin'

TheoneandonlyMrK · May 7, 2019

FordGT90Concept said:
Rapid Packed Math is really simple: the FP32 FPUs can alternatively handle 2xFP16 in the same space/cycle.

Well that's it's initial implementation, later versions support lower bit ranges like 4x16bit 8x8bit 16x4bit and that's through 64bit wavefronts not 32 ,on 32 bit jobs it can still throughout 2x.

This is why Gcn isn't changing as soon as some would like.

londiste · May 7, 2019

theoneandonlymrk said:
Well that's it's initial implementation, later versions support lower bit ranges like 4x16bit 8x8bit 16x4bit and that's through 64bit wavefronts not 32 ,on 32 bit jobs it can still throughout 2x.
This is why Gcn isn't changing as soon as some would like.

Vega already has 1xFP32, 2xFP16, 4xINT8 and 8xINT4, so does Turing. Pascal should have everything besides 2xFP16.
Lower bit ranges have quite limited utility though and these have really not been used much in other than some ML applications.

ratirt · May 7, 2019

FordGT90Concept said:
Tensor cores are FP16*FP16+(FP16|FP32) matrix solvers. Deep Learning for dummies.

I'm not sure where you are going with this but thanks for the tip and that's what I said. Mixed precision. Anyway my confusion with you is about a different matter. let me ask you straight. I understand that tensor cores are AI for you or the deep learning or did I just understand you wrong cause that's my impression.

TheoneandonlyMrK · May 7, 2019

londiste said:
Vega already has 1xFP32, 2xFP16, 4xINT8 and 8xINT4, so does Turing. Pascal should have everything besides 2xFP16.
Lower bit ranges have quite limited utility though and these have really not been used much in other than some ML applications.

They ,meaning Nvidia, do not have RPM , they Can do all of it ,but do some of it with special hardware ie tensor or RtRt core's and some is done by cuda core's but they're not doing it the same way at all.

I have a vega, i know what it can do.

AlienIsGOD · May 7, 2019

CHOO CHOOOOO!!!!1! Navi Hype Train be rollin'

looks like thread title was written by a 5 year old.....

londiste · May 7, 2019

theoneandonlymrk said:
They ,meaning Nvidia, do not have RPM , they Can do all of it ,but do some of it with special hardware ie tensor or RtRt core's and some is done by cuda core's but they're not doing it the same way at all.

Yes, Nvidia has a different implementation. Does it matter all that much as long as the same featureset is there?

TheoneandonlyMrK · May 7, 2019

londiste said:
Yes, Nvidia has a different implementation. Does it matter all that much as long as the same featureset is there?

It does to Nvidia and Amd , but not so much to us no.
But in saying that Nvidia are making quite the big deal at the moment about what they're Special hardware can do aren't they.

londiste · May 7, 2019

theoneandonlymrk said:
But in saying that Nvidia are making quite the big deal at the moment about what they're Special hardware can do aren't they.

Well, it depends on the context or features/hardware in question.
Couple operations Nvidia implemented in hardware as RT Cores do seem to be somewhat worth hyping - doable in shaders definitely but RT Cores are clearly much more efficient at them.
Tensor cores are a question but it looks like Nvidia has been somewhat hush-hush about what these actually do. For example the part where FP16 is done (or can be done) on Tensor cores is worth noting but of the bigger sites Anandtech was the one that caught wind of it for their TU116 review. I would say this is interesting.

moproblems99 · May 7, 2019

AlienIsGOD said:
CHOO CHOOOOO!!!!1! Navi Hype Train be rollin'

looks like thread title was written by a 5 year old.....

I am 7 actually, sheesh. Age Descrimination. The thread was supposed to be fun (and a joke) because everybody is salty as fuck. Like you. Carry on.

TheoneandonlyMrK · May 7, 2019

londiste said:
Well, it depends on the context or features/hardware in question.
Couple operations Nvidia implemented in hardware as RT Cores do seem to be somewhat worth hyping - doable in shaders definitely but RT Cores are clearly much more efficient at them.
Tensor cores are a question but it looks like Nvidia has been somewhat hush-hush about what these actually do. For example the part where FP16 is done (or can be done) on Tensor cores is worth noting but of the bigger sites Anandtech was the one that caught wind of it for their TU116 review. I would say this is interesting.

So it does matter just only if it's Nvidia lauding it, anywho.

In the context of this thread we probably need to get more on topic.

AlienIsGOD · May 7, 2019

moproblems99 said:
I am 7 actually, sheesh. Age Descrimination. The thread was supposed to be fun (and a joke) because everybody is salty as fuck. Like you. Carry on.

LOL I'm not salty, just wish ppl could act and write more adult like.... This site has gone downhill forum wise the last few years...

juiseman · May 7, 2019

AMD Scores EPYC Win With Cray And ORNL On Frontier 1.5 Exaflop Supercomputer

https://hothardware.com/news/amd-epyc-radeon-instinct-ornl-supercomputer

This is a big win for AMD

moproblems99 · May 7, 2019

AlienIsGOD said:
LOL I'm not salty, just wish ppl could act and write more adult like.

I'm sorry you couldn't see the joke that it was. I have ordered a happy meal for you.

steen · May 7, 2019

londiste said:
Tensor cores are a question but it looks like Nvidia has been somewhat hush-hush about what these actually do. For example the part where FP16 is done (or can be done) on Tensor cores is worth noting but of the bigger sites Anandtech was the one that caught wind of it for their TU116 review. I would say this is interesting.

For RTX TU, FP16 is exclusively a tensor op. GTX TU FP16 is interesting given no tensors according to NV. I'm not entirely convinced the hardware is very different. TU SM layout is more tightly packed than GP, but RTX/Tensor silicon appears to be only ~10% of the die. TU uarch is higher area consuming even without the RTX pipeline. Given RTX features only make sense with a minimum raster performance level (2060), I wouldn't be surprised if GTX TU had similar hardware but limited to fp16 ops. The big benefit of RTX tensor cores IMO is the FP32 accumulate for data science.

eidairaman1 · May 7, 2019

moproblems99 said:
I am 7 actually, sheesh. Age Descrimination. The thread was supposed to be fun (and a joke) because everybody is salty as fuck. Like you. Carry on.

should be in General Nonsense

FordGT90Concept · May 7, 2019

ratirt said:
I'm not sure where you are going with this but thanks for the tip and that's what I said. Mixed precision. Anyway my confusion with you is about a different matter. let me ask you straight. I understand that tensor cores are AI for you or the deep learning or did I just understand you wrong cause that's my impression.

The add is the only one that supports FP32 and the reason for that is so that it is less likely to overflow the FP16*FP16 result. The main point (and why it is good for AI) is that it is a matrix solver for tensor flow. AMD doesn't have a matrix solver. GCN has to do these calculations on the shaders which is much, much slower. Example: Vega can do about 24 TFLOP FP16; Volta can do over 100 TFLOP FP16 in its tensor cores alone.

theoneandonlymrk said:
They ,meaning Nvidia, do not have RPM , they Can do all of it ,but do some of it with special hardware ie tensor or RtRt core's and some is done by cuda core's but they're not doing it the same way at all.

I have a vega, i know what it can do.

NVIDIA added parallelism to deal with the problem in Turing where AMD made Vega more flexible. As a result, Turing has a lot of transistors but more performance where Vega has fewer transistors but less performance.

AMD is going to want to compete in AI so AMD is going to have to add tensor cores eventually but I don't think that is in Navi because it was made for Sony who has no use for it.

steen · May 7, 2019

FordGT90Concept said:
NVIDIA added parallelism to deal with the problem in Turing where AMD made Vega more flexible. As a result, Turing has a lot of transistors but more performance where Vega has fewer transistors but less performance.

Not entirely the same. GCN makes no distinction between graphics & compute modes & can schedule concurrently. TU is better at this than GP et al, but parallelism is a function of running integer & floats at the same time. Just highlights the different uarch approaches. NV prefers discrete specialized silicon costing more die space, whereas AMD (til now) has preferred generalist alus.

FordGT90Concept · May 7, 2019

Turing doesn't sacrifice anything (other than die space) for concurrent FP16 performance. Vega gets FP16 performance by taking away from FP32 performance. This is a disadvantage for Vega and an advantage for Turing when it comes to anything that can benefit from FP16.

TheoneandonlyMrK · May 7, 2019

FordGT90Concept said:
The add is the only one that supports FP32 and the reason for that is so that it is less likely to overflow the FP16*FP16 result. The main point (and why it is good for AI) is that it is a matrix solver for tensor flow. AMD doesn't have a matrix solver. GCN has to do these calculations on the shaders which is much, much slower. Example: Vega can do about 24 TFLOP FP16; Volta can do over 100 TFLOP FP16 in its tensor cores alone.

NVIDIA added parallelism to deal with the problem in Turing where AMD made Vega more flexible. As a result, Turing has a lot of transistors but more performance where Vega has fewer transistors but less performance.

AMD is going to want to compete in AI so AMD is going to have to add tensor cores eventually but I don't think that is in Navi because it was made for Sony who has no use for it.

Nvidia couldn't easily put back 64bit compute, they had to go special hardware they added tensor cores after Google ditched their GPUs for their own tensor asic.

And just look how much use their specific hardware is generally, it's useless.

FordGT90Concept · May 7, 2019

For games, mostly. Navi is a gaming product which is why I don't think it will have tensor cores. I would be shocked if Arcturus didn't have tensor cores because AMD is so far behind in machine learning. Then again, companies like Tesla are designing their own chips for machine learning anyway.

Point is: RPM doesn't help much with tensor flow where RTX's tensor cores do. DLSS isn't something Navi will have because it will lack the hardware to do it effectively.

CrAsHnBuRnXp · May 7, 2019

If patterns are anything to go by, the AMD hype train for their GPU's are going to crash.

Deleted member 24505 · May 7, 2019

AlienIsGOD said:
CHOO CHOOOOO!!!!1! Navi Hype Train be rollin'

looks like thread title was written by a 5 year old.....

what you have only just noticed it, mr salty

steen · May 8, 2019

FordGT90Concept said:
Turing doesn't sacrifice anything (other than die space) for concurrent FP16 performance.

What does "concurrent fp16" even mean? You are aware that half floats & RPM (2xfp16) are used instead of fp32 to increase performance of ops not requiring full float precision? It's a register/resource & throughput gain in the case of 2xfp16. Int32, Int16, transcendentals, etc, still happen in the SM. TU "concurrency" is the ability to pack both integer & floats in the pipeline without bubbles/stalls/context switching.

Vega gets FP16 performance by taking away from FP32 performance. This is a disadvantage for Vega and an advantage for Turing when it comes to anything that can benefit from FP16.

Frightening. You should read the TU uarch & mixed precision white papers.

FordGT90Concept said:
Point is: RPM doesn't help much with tensor flow where RTX's tensor cores do. DLSS isn't something Navi will have because it will lack the hardware to do it effectively.

Tensor math is just 4x4 matrix FMA. It's the ability of the tensors to work on fp16, int8, int4 that makes them useful in nn ML. I asked someone else earlier: what do you think DLSS is?

Midiamp · May 8, 2019

CrAsHnBuRnXp said:
If patterns are anything to go by, the AMD hype train for their GPU's are going to crash.

AMD has a bad marketing team. Instead of quelling down the rumors, they just let the rumor spread like wild fire. I was one of the victim of the Radeon 7 hype train. The fall from hype hurts so bad, I now consider EVERY rumor about Zen 2 and Navi as nothing but bad gossip. Frankly I don't want to be a part of a community that harbors and encourage spreading of bad information.

seronx · May 8, 2019

If one googles GFX1010:
//On GFX10 I$ is 4 x 64 bytes cache lines. By default prefetcher keeps one cache line behind and reads two ahead. We can modify it with S_INST_PREFETCH for larger loops to have two lines behind and one ahead. Therefor we can benefit from aligning loop headers if loop fits 192 bytes. If loop fits 64 bytes it always spans no more than two cache lines and does not need an alignment. Else if loop is less or equal 128 bytes we do not need to modify prefetch, Else if loop is less or equal 192 bytes we need two lines behind.

-> L0 cache, which is referred to below.

// In WGP mode the waves of a work-group can be executing on either CU of the WGP. Therefore need to invalidate the L0 which is per CU. Otherwise in CU mode and all waves of a work-group are on the same CU, and so the L0 does not need to be invalidated.

-> CU mode and WGP mode

// HWRC = Register destination cache
&
// Try to reassign registers on GFX10+ to reduce register bank conflicts.
// On GFX10 registers are organized in banks. VGPRs have 4 banks assigned in a round-robin fashion: v0, v4, v8... belong to bank 0. v1, v5, v9... to bank 1, etc. SGPRs have 8 banks and allocated in pairs, so that s0:s1, s16:s17, s32:s33 are at bank 0. s2:s3, s18:s19, s34:s35 are at bank 1 etc.
// The shader can read one dword from each of these banks once per cycle. If an instruction has to read more register operands from the same bank an additional cycle is needed. HW attempts to pre-load registers through input operand gathering, but a stall cycle may occur if that fails. For example V_FMA_F32 V111 = V0 + V4 * V8 will need 3 cycles to read operands, potentially incuring 2 stall cycles.
// The pass tries to reassign registers to reduce bank conflicts.
// In this pass bank numbers 0-3 are VGPR banks and 4-11 are SGPR banks, so that 4 has to be subtracted from an SGPR bank number to get the real value. This also corresponds to bit numbers in bank masks used in the pass.

-> HWRC and banking are part of Super-SIMD patents;
https://patents.google.com/patent/US20180357064A1
https://patents.google.com/patent/US20180121386A1

//In one embodiment, each bank of the vector destination cache holds 4 entries, for a total 8 entries with 2 banks.
-> destination register cache // HWRC => 8 destination registers with 3-entry source operand forwarding.

//In one embodiment, source operands buffer holds up to 6 VALU instruction's source operands. In one embodiment, source operand buffer includes dedicated buffers for providing 3 different operands per clock cycle to serve instructions like a fused multiply-add operation which performs a*b+c.
-> source operand buffer => 6 * 3-entry source operand buffer

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506

Processor	Ryzen 7800X3D
Motherboard	ROG STRIX B650E-F GAMING WIFI
Memory	2x16GB G.Skill Flare X5 DDR5-6000 CL36 (F5-6000J3636F16GX2-FX5)
Video Card(s)	INNO3D GeForce RTX™ 4070 Ti SUPER TWIN X2
Storage	2TB Samsung 980 PRO, 4TB WD Black SN850X
Display(s)	42" LG C2 OLED, 27" ASUS PG279Q
Case	Thermaltake Core P5
Power Supply	Fractal Design Ion+ Platinum 760W
Mouse	Corsair Dark Core RGB Pro SE
Keyboard	Corsair K100 RGB
VR HMD	HTC Vive Cosmos

System Name	Bro2
Processor	Ryzen 5800X
Motherboard	Gigabyte X570 Aorus Elite
Cooling	Corsair h115i pro rgb
Memory	32GB G.Skill Flare X 3200 CL14 @3800Mhz CL16
Video Card(s)	Powercolor 6900 XT Red Devil 1.1v@2400Mhz
Storage	M.2 Samsung 970 Evo Plus 500MB/ Samsung 860 Evo 1TB
Display(s)	LG 27UD69 UHD / LG 27GN950
Case	Fractal Design G
Audio Device(s)	Realtec 5.1
Power Supply	Seasonic 750W GOLD
Mouse	Logitech G402
Keyboard	Logitech slim
Software	Windows 10 64 bit

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506

System Name	Aliens Ryzen Rig \| 2nd Hand Omen
Processor	Ryzen R5 5600 \| Ryzen R5 3600
Motherboard	Gigabyte B450 Aorus Elite (F61 BIOS) \| B450 matx
Cooling	DeepCool Castle EX V2 240mm AIO\| stock for now
Memory	8GB X 2 DDR4 3000mhz Team Group Vulcan \| 16GB DDR4
Video Card(s)	Sapphire Pulse RX 5700 8GB \| GTX 1650 4GB
Storage	Adata XPG 8200 PRO 512GB SSD OS / 240 SSD + 2TB M.2 SSD Games / 1000 GB Data \| SSD + HDD
Display(s)	Acer Nitro x27OU 27" VA 165hz Freesync Premium\|TCL 32" 1080P w/ HDR
Case	NZXT H500 Black \| HP Omen Obelisk
Audio Device(s)	Onboard Realtek \| Onboard Realtek
Power Supply	EVGA SuperNOVA G3 650w 80+ Gold \| 500w
Mouse	Steelseries Rival 500 15 button mouse w/ Razor Goliathus Chroma XL mousemat \| Logitech G502
Keyboard	Corsair K65 Mini w/ Cherry MX brown keys \| Logitech G513 Carbon w/ Romer G tactile keys
Software	Windows 10 Pro \| Windows 10 Pro

CHOO CHOOOOO!!!!1! Navi Hype Train be rollin'

TheoneandonlyMrK

londiste

ratirt

TheoneandonlyMrK

AlienIsGOD

Vanguard Beta Tester

londiste

TheoneandonlyMrK

londiste

moproblems99

TheoneandonlyMrK

AlienIsGOD

Vanguard Beta Tester

juiseman

moproblems99

steen

eidairaman1

The Exiled Airman

FordGT90Concept

"I go fast!1!11!1!"

steen

FordGT90Concept

"I go fast!1!11!1!"

TheoneandonlyMrK

FordGT90Concept

"I go fast!1!11!1!"

CrAsHnBuRnXp

Deleted member 24505

Guest

steen

Midiamp

seronx

System Name	Wut?
Processor	3900X
Motherboard	ASRock Taichi X570
Cooling	Water
Memory	32GB GSkill CL16 3600mhz
Video Card(s)	Vega 56
Storage	2 x AData XPG 8200 Pro 1TB
Display(s)	3440 x 1440
Case	Thermaltake Tower 900
Power Supply	Seasonic Prime Ultra Platinum

System Name	PCGOD
Processor	AMD FX 8350@ 5.0GHz
Motherboard	Asus TUF 990FX Sabertooth R2 2901 Bios
Cooling	Scythe Ashura, 2×BitFenix 230mm Spectre Pro LED (Blue,Green), 2x BitFenix 140mm Spectre Pro LED
Memory	16 GB Gskill Ripjaws X 2133 (2400 OC, 10-10-12-20-20, 1T, 1.65V)
Video Card(s)	AMD Radeon 290 Sapphire Vapor-X
Storage	Samsung 840 Pro 256GB, WD Velociraptor 1TB
Display(s)	NEC Multisync LCD 1700V (Display Port Adapter)
Case	AeroCool Xpredator Evil Blue Edition
Audio Device(s)	Creative Labs Sound Blaster ZxR
Power Supply	Seasonic 1250 XM2 Series (XP3)
Mouse	Roccat Kone XTD
Keyboard	Roccat Ryos MK Pro
Software	Windows 7 Pro 64

System Name	BY-2021
Processor	AMD Ryzen 7 5800X (65w eco profile)
Motherboard	MSI B550 Gaming Plus
Cooling	Scythe Mugen (rev 5)
Memory	2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s)	AMD Radeon RX 7900 XT
Storage	Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s)	Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case	Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s)	Realtek ALC1150, Micca OriGen+
Power Supply	Enermax Platimax 850w
Mouse	Nixeus REVEL-X
Keyboard	Tesoro Excalibur
Software	Windows 10 Home 64-bit
Benchmark Scores	Faster than the tortoise; slower than the hare.

Processor	Intel i9 9900K @5GHz w/ Corsair H150i Pro CPU AiO w/Corsair HD120 RBG fan
Motherboard	Asus Z390 Maximus XI Code
Cooling	6x120mm Corsair HD120 RBG fans
Memory	Corsair Vengeance RBG 2x8GB 3600MHz
Video Card(s)	Asus RTX 3080Ti STRIX OC
Storage	Samsung 970 EVO Plus 500GB , 970 EVO 1TB, Samsung 850 EVO 1TB SSD, 10TB Synology DS1621+ RAID5
Display(s)	Corsair Xeneon 32" 32UHD144 4K
Case	Corsair 570x RBG Tempered Glass
Audio Device(s)	Onboard / Corsair Virtuoso XT Wireless RGB
Power Supply	Corsair HX850w Platinum Series
Mouse	Logitech G604s
Keyboard	Corsair K70 Rapidfire
Software	Windows 11 x64 Professional
Benchmark Scores	Firestrike - 23520 Heaven - 3670

Processor	Reasonably good Intel CPU
Motherboard	Eh, the cheapest ATX that supports the processor
Cooling	Big ass Noctua always a good thing to have
Memory	Cheapest 32GB kit for my needs
Video Card(s)	Nvidia 3070
Storage	NVME 2x, SSD 2x, HDD 1x
Display(s)	Dual monitor 1080p for life!
Case	NZXT Flow
Audio Device(s)	ALC something-something
Power Supply	Good ol' Corsair
Mouse	Good ol' Corsair
Keyboard	Cheap Logitech wireless keyboard
Software	Windows 11 Pro

System Name	SolarwindMobile
Processor	AMD FX-9800P RADEON R7, 12 COMPUTE CORES 4C+8G
Motherboard	Acer Wasp_BR
Cooling	It's Copper.
Memory	2 x 8GB SK Hynix/HMA41GS6AFR8N-TF
Video Card(s)	ATI/AMD Radeon R7 Series (Bristol Ridge FP4) [ACER]
Storage	TOSHIBA MQ01ABD100 1TB + KINGSTON RBU-SNS8152S3128GG2 128 GB
Display(s)	ViewSonic XG2401 SERIES
Case	Acer Aspire E5-553G
Audio Device(s)	Realtek ALC255
Power Supply	PANASONIC AS16A5K
Mouse	SteelSeries Rival
Keyboard	Ducky Channel Shine 3
Software	Windows 10 Home 64-bit (Version 1607, Build 14393.969)