• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

NVIDIA to Enable DXR Ray Tracing on GTX (10- and 16-series) GPUs in April Drivers Update

Joined
Feb 3, 2017
Messages
3,868 (1.33/day)
Processor Ryzen 7800X3D
Motherboard ROG STRIX B650E-F GAMING WIFI
Memory 2x16GB G.Skill Flare X5 DDR5-6000 CL36 (F5-6000J3636F16GX2-FX5)
Video Card(s) INNO3D GeForce RTX™ 4070 Ti SUPER TWIN X2
Storage 2TB Samsung 980 PRO, 4TB WD Black SN850X
Display(s) 42" LG C2 OLED, 27" ASUS PG279Q
Case Thermaltake Core P5
Power Supply Fractal Design Ion+ Platinum 760W
Mouse Corsair Dark Core RGB Pro SE
Keyboard Corsair K100 RGB
VR HMD HTC Vive Cosmos
16 bit precision
That shape is definitely not from 16-bit precision.
Still waiting for CryTek to provide all the details but it is highly likely Vegas 16-bit precision is used for RT.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,263 (4.42/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Makes sense, double the flops for slight loss in accuracy. Pretty sure Radeon Rays has 16-bit precision options.

Edit: https://github.com/GPUOpen-Librarie...adeonRays/src/intersector/intersector_lds.cpp
Code:
if (spec.has_fp16)
  m_gpudata->qbvh_prog.executable = m_device->CompileExecutable("../RadeonRays/src/kernels/CL/intersect_bvh2_lds_fp16.cl", headers, numheaders, buildopts.c_str());
Yup, RadeonRays SDK has FP16 checks.
 
Last edited:
Joined
Feb 19, 2019
Messages
324 (0.15/day)
Yes- I made a comment few post back- they just added BVH GPU acceleration and FP16 support for RadeonRays 3.0 last week.
https://www.techpowerup.com/forums/...ril-drivers-update.253759/page-4#post-4018035

Also- how FP16 impact on performance on Polaris? it offers 1:1 Ratio- but can FP16 help in better utilizing memory Bandwidth vs INT32/FP32?
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,263 (4.42/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
The FP16 code paths wouldn't be taken on Polaris at all. They aren't something the GPU understands.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,263 (4.42/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
No, DXR would likely be implemented using FP32 if AMD were to add backwards compatible support for it. I assume that it would just be noisier (fewer rays/bounces).
 
Joined
Sep 22, 2012
Messages
1,010 (0.22/day)
Location
Belgrade, Serbia
System Name Intel® X99 Wellsburg
Processor Intel® Core™ i7-5820K - 4.5GHz
Motherboard ASUS Rampage V E10 (1801)
Cooling EK RGB Monoblock + EK XRES D5 Revo Glass PWM
Memory CMD16GX4M4A2666C15
Video Card(s) ASUS GTX1080Ti Poseidon
Storage Samsung 970 EVO PLUS 1TB /850 EVO 1TB / WD Black 2TB
Display(s) Samsung P2450H
Case Lian Li PC-O11 WXC
Audio Device(s) CREATIVE Sound Blaster ZxR
Power Supply EVGA 1200 P2 Platinum
Mouse Logitech G900 / SS QCK
Keyboard Deck 87 Francium Pro
Software Windows 10 Pro x64
Than I didn't wrong because paid 580 euro for GTX1080Ti Poseidon before 5 months???
Peformance of RTX2080 FE, 11GB, option for water cooling (not full cover waterblock) but at least for that money is included cooler comparable with AIO kits worth over 100$.
And option to install full custom waterblock for ASUS Strix when price drop to around 70 euro.

We can say different things for AMD, but they help to everyone with their research and trying to offer for lower price.
Intel and NVIDIA alone would massacred us with their politic and prices without AMD.

First impression after GeForce users switch to high end Radeon graphic card will be... I feel like picture quality is little better, like sharper cleaner more photographic image.
And in future if NVIDIA continue to ask 1000$ for high end GPU and AMD offer 10% weaker for 750$, I'm ready to consider combination Intel-Radeon.
 
Last edited:
Joined
Mar 10, 2014
Messages
1,795 (0.45/day)
No, DXR would likely be implemented using FP32 if AMD were to add backwards compatible support for it. I assume that it would just be noisier (fewer rays/bounces).

Well there might be some use of int cores with in it. Not necessary as low precision as int8 is but that exodus frame pic shows quite hefty use of int32 math during RT(Possibly just filtering/denosiong but anyhow). I think the possibilty to do fp32 to and int32 math concurrently will give one way to do it hard way. Volta and Turings can do it that way and Pascal obviously can't. Which bodes the question: how is GCN, can it run fp32 and int32 math concurrently?

 
Joined
Sep 24, 2018
Messages
14 (0.01/day)
That's the point of this move. Making the "poor" look not so poor in comparison.

"- Boys, let's release this RTX we've been working on for a few years, we'll have it exclusively in the new cards, so we'll sell them for premium.
- Boss, it's not that great yet, our new hardware still isn't that capable of fast and proper implementation.
- Just do it, it will be the first hardware Ray Tracing bling-bling ever, it's a big deal. We'll get it working in a couple AAA games and people will jump into it.
(... few months later...)
- Boss, people aren't joining the RTX bandwagon... and they aren't really swapping Pascal for Turing.
- Well then execute plan B: unlock Ray Tracing for old Pascal.
- But boss, those have no Ray Tracing focused hardware, it will run tons even worse.
- Exactly, we'll make them feel that Pascal is ancient crap, and then they'll want to finally swap them for RTXs. At the same time we'll spread the name even more.
(... weeks later...)
- Boss, still not a big interest in RTX 2000 cards. What now?
- Fine, release the RTX 3000 series with proper improved RTX performance, we'll make RTX 2000 look like ancient crap in comparison, and RTX 3000 look like the second coming of baby Jesus."


Just like they said when they presented RTX, bringing a card to the market that can do Ray Tracing this "fast" (compared to before) is quite an achievement. Problem is: it's still not fast enough.

"- People, we made the impossible: a card that can finally do the legendary tech that is "Ray Tracing"! Behold!
- Cool!
(...) Ok, nevermind, it runs slow. And I can live without it for now, can barely see the difference anyway.
- No, you don't understand. This is dope engineering. If it wasn't for this new hardware, it would be a slideshow with your current card.
- Y, but it's still slow. Not appealing.
- Look, we'll show you. Let's test with your current card.
- Dude, please..."

:clap::rockout::laugh::laugh::laugh::laugh::laugh:
 
Joined
Sep 27, 2014
Messages
550 (0.15/day)
Volta and Turings can do it that way and Pascal obviously can't. Which bodes the question: how is GCN, can it run fp32 and int32 math concurrently?
I doubt that Turing can do it 100% either, even if they say "independent integer execution units", I think it's like Hyperthreading in CPU's. I might be wrong though...
 
Last edited:
Joined
Mar 10, 2014
Messages
1,795 (0.45/day)
I doubt that Turing can do it 100% either, even if they say "independent integer execution units", I think it's like Hyperthreading in CPU's. I might be wrong though...

Well it can do fp32 at full throttle and concurrently run some int operations with it's separate int cores. It's actually all about keeping gpu busy doing floating point math.

On Turings whitepaper they say even on current games they could get some performance benefit only from that. And is the one of the main reasons why Turing gets more performance out of TFlops compared to Pascal(The second being new shared cache).
 
Joined
Sep 27, 2014
Messages
550 (0.15/day)
Yes, I saw that. This is the part reminded me of hyper-threading concept on CPU's:
First, the Turing SM adds a new independent integer datapath that can execute instructions concurrently with the floating-point math datapath.... This translates to 2x more bandwidth and more than 2x more capacity available for L1 cache for common workloads.
Is not like they have added extra integer units. To me it looks like they can now issue those INT instructions in the same time with FP32 ones.

Later this gets a little more... confusing.
Two SMs are included per TPC, and each SM has a total of 64 FP32 Cores and 64 INT32 Cores. In comparison, the Pascal GP10x GPUs have one SM per TPC and 128 FP32 Cores per SM. The Turing SM supports concurrent execution of FP32 and INT32 operations (more details below), independent thread scheduling similar to the Volta GV100 GPU.
So in the first sentence they say that those are separate cores. But the numbers (64FP+64INT) add to the same number like on Pascal (128FP). And the last sentence talks again specifically of "thread scheduling"...
I don't think there would be a physical difference between an INT32 and a FP32 core. It's only the schedulers that make that difference.

Overall, the changes in SM enable Turing to achieve 50% improvement in delivered performance per CUDA core.
This is in line with the hyper-threading gains on CPU's.
 
Last edited:
Joined
Mar 10, 2014
Messages
1,795 (0.45/day)
Yes, I saw that. This is the part reminded me of hyper-threading concept on CPU's:

Is not like they have added extra integer units. To me it looks like they can now issue those INT instructions in the same time with FP32 ones.

Later this gets a little more... confusing.

So in the first sentence they say that those are separate cores. But the numbers (64FP+64INT) add to the same number like on Pascal (128FP). And the last sentence talks again specifically of "thread scheduling"...

As it says Turing TPC has _two_ of those (64int+64fp) SMs, Pascal TPC have _one_ 128FP SM. On the one more confusing note: according to Nvidia Turings without tensor cores have separate fp16 cores.
 
Joined
Sep 27, 2014
Messages
550 (0.15/day)
Hmm...
GP102-450-A1 has 3840 cores and 30 Streaming Multiprocessors. 3840/30=128
GP104-410-A1 has 2560 / 20 = 128
TU102-400-A1 has 4608 cores and 72 Streaming Multiprocessors. 4608/72=64
TU104-400-A1 has 2944 / 46 = 64

To me looks like definitelly they broke those SM in two and they can issue either INT32 or FP32 commands to them, on independent paths. Which is consistent with approx 50% gains.
 
Top