• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

NVIDIA to Enable DXR Ray Tracing on GTX (10- and 16-series) GPUs in April Drivers Update

16 bit precision
That shape is definitely not from 16-bit precision.
Still waiting for CryTek to provide all the details but it is highly likely Vegas 16-bit precision is used for RT.
 
Makes sense, double the flops for slight loss in accuracy. Pretty sure Radeon Rays has 16-bit precision options.

Edit: https://github.com/GPUOpen-Librarie...adeonRays/src/intersector/intersector_lds.cpp
Code:
if (spec.has_fp16)
  m_gpudata->qbvh_prog.executable = m_device->CompileExecutable("../RadeonRays/src/kernels/CL/intersect_bvh2_lds_fp16.cl", headers, numheaders, buildopts.c_str());
Yup, RadeonRays SDK has FP16 checks.
 
Last edited:
Yes- I made a comment few post back- they just added BVH GPU acceleration and FP16 support for RadeonRays 3.0 last week.
https://www.techpowerup.com/forums/...ril-drivers-update.253759/page-4#post-4018035
Radeon-Rays.jpg

Also- how FP16 impact on performance on Polaris? it offers 1:1 Ratio- but can FP16 help in better utilizing memory Bandwidth vs INT32/FP32?
 
The FP16 code paths wouldn't be taken on Polaris at all. They aren't something the GPU understands.
 
No, DXR would likely be implemented using FP32 if AMD were to add backwards compatible support for it. I assume that it would just be noisier (fewer rays/bounces).
 
Than I didn't wrong because paid 580 euro for GTX1080Ti Poseidon before 5 months???
Peformance of RTX2080 FE, 11GB, option for water cooling (not full cover waterblock) but at least for that money is included cooler comparable with AIO kits worth over 100$.
And option to install full custom waterblock for ASUS Strix when price drop to around 70 euro.

We can say different things for AMD, but they help to everyone with their research and trying to offer for lower price.
Intel and NVIDIA alone would massacred us with their politic and prices without AMD.

First impression after GeForce users switch to high end Radeon graphic card will be... I feel like picture quality is little better, like sharper cleaner more photographic image.
And in future if NVIDIA continue to ask 1000$ for high end GPU and AMD offer 10% weaker for 750$, I'm ready to consider combination Intel-Radeon.
 
Last edited:
No, DXR would likely be implemented using FP32 if AMD were to add backwards compatible support for it. I assume that it would just be noisier (fewer rays/bounces).

Well there might be some use of int cores with in it. Not necessary as low precision as int8 is but that exodus frame pic shows quite hefty use of int32 math during RT(Possibly just filtering/denosiong but anyhow). I think the possibilty to do fp32 to and int32 math concurrently will give one way to do it hard way. Volta and Turings can do it that way and Pascal obviously can't. Which bodes the question: how is GCN, can it run fp32 and int32 math concurrently?

geforce-rtx-gtx-dxr-one-metro-exodus-frame-850px.png
 
That's the point of this move. Making the "poor" look not so poor in comparison.

"- Boys, let's release this RTX we've been working on for a few years, we'll have it exclusively in the new cards, so we'll sell them for premium.
- Boss, it's not that great yet, our new hardware still isn't that capable of fast and proper implementation.
- Just do it, it will be the first hardware Ray Tracing bling-bling ever, it's a big deal. We'll get it working in a couple AAA games and people will jump into it.
(... few months later...)
- Boss, people aren't joining the RTX bandwagon... and they aren't really swapping Pascal for Turing.
- Well then execute plan B: unlock Ray Tracing for old Pascal.
- But boss, those have no Ray Tracing focused hardware, it will run tons even worse.
- Exactly, we'll make them feel that Pascal is ancient crap, and then they'll want to finally swap them for RTXs. At the same time we'll spread the name even more.
(... weeks later...)
- Boss, still not a big interest in RTX 2000 cards. What now?
- Fine, release the RTX 3000 series with proper improved RTX performance, we'll make RTX 2000 look like ancient crap in comparison, and RTX 3000 look like the second coming of baby Jesus."


Just like they said when they presented RTX, bringing a card to the market that can do Ray Tracing this "fast" (compared to before) is quite an achievement. Problem is: it's still not fast enough.

"- People, we made the impossible: a card that can finally do the legendary tech that is "Ray Tracing"! Behold!
- Cool!
(...) Ok, nevermind, it runs slow. And I can live without it for now, can barely see the difference anyway.
- No, you don't understand. This is dope engineering. If it wasn't for this new hardware, it would be a slideshow with your current card.
- Y, but it's still slow. Not appealing.
- Look, we'll show you. Let's test with your current card.
- Dude, please..."

:clap::rockout::laugh::laugh::laugh::laugh::laugh:
 
Volta and Turings can do it that way and Pascal obviously can't. Which bodes the question: how is GCN, can it run fp32 and int32 math concurrently?
I doubt that Turing can do it 100% either, even if they say "independent integer execution units", I think it's like Hyperthreading in CPU's. I might be wrong though...
 
Last edited:
I doubt that Turing can do it 100% either, even if they say "independent integer execution units", I think it's like Hyperthreading in CPU's. I might be wrong though...

Well it can do fp32 at full throttle and concurrently run some int operations with it's separate int cores. It's actually all about keeping gpu busy doing floating point math.

On Turings whitepaper they say even on current games they could get some performance benefit only from that. And is the one of the main reasons why Turing gets more performance out of TFlops compared to Pascal(The second being new shared cache).
image5.jpg
 
Yes, I saw that. This is the part reminded me of hyper-threading concept on CPU's:
First, the Turing SM adds a new independent integer datapath that can execute instructions concurrently with the floating-point math datapath.... This translates to 2x more bandwidth and more than 2x more capacity available for L1 cache for common workloads.
Is not like they have added extra integer units. To me it looks like they can now issue those INT instructions in the same time with FP32 ones.

Later this gets a little more... confusing.
Two SMs are included per TPC, and each SM has a total of 64 FP32 Cores and 64 INT32 Cores. In comparison, the Pascal GP10x GPUs have one SM per TPC and 128 FP32 Cores per SM. The Turing SM supports concurrent execution of FP32 and INT32 operations (more details below), independent thread scheduling similar to the Volta GV100 GPU.
So in the first sentence they say that those are separate cores. But the numbers (64FP+64INT) add to the same number like on Pascal (128FP). And the last sentence talks again specifically of "thread scheduling"...
I don't think there would be a physical difference between an INT32 and a FP32 core. It's only the schedulers that make that difference.

Overall, the changes in SM enable Turing to achieve 50% improvement in delivered performance per CUDA core.
This is in line with the hyper-threading gains on CPU's.
 
Last edited:
Yes, I saw that. This is the part reminded me of hyper-threading concept on CPU's:

Is not like they have added extra integer units. To me it looks like they can now issue those INT instructions in the same time with FP32 ones.

Later this gets a little more... confusing.

So in the first sentence they say that those are separate cores. But the numbers (64FP+64INT) add to the same number like on Pascal (128FP). And the last sentence talks again specifically of "thread scheduling"...

As it says Turing TPC has _two_ of those (64int+64fp) SMs, Pascal TPC have _one_ 128FP SM. On the one more confusing note: according to Nvidia Turings without tensor cores have separate fp16 cores.
 
Hmm...
GP102-450-A1 has 3840 cores and 30 Streaming Multiprocessors. 3840/30=128
GP104-410-A1 has 2560 / 20 = 128
TU102-400-A1 has 4608 cores and 72 Streaming Multiprocessors. 4608/72=64
TU104-400-A1 has 2944 / 46 = 64

To me looks like definitelly they broke those SM in two and they can issue either INT32 or FP32 commands to them, on independent paths. Which is consistent with approx 50% gains.
 
Back
Top