GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

Vya Domus · Apr 19, 2021

Punkenjoy said:
having 64 MB instead of 96 MB could have mean that the card would end up with a 8 GB memory buffer instead if 12, also, it would have mean either a 256 Bit or 128 bit bus. There is a relation between the amount of memory on the card and the amount of infinity cache. This is also probably one of the tricks AMD use to lower memory latency by caching a specific amount of memory per MB of infinity cache. This simplify the caching algorithm. (meaning it take less time to run, ie lower latency.)

Also something that i don't have the data on, but since the relation to memory bus/memory size seems clear, it's quite possible that the 96MB block on NAVI 22 have less bandwidth than the 128 MB on Navi 21.

Also, all chip maker have simulator in house. They probably already tested the scenario you propose versus the scenario they choose in simulation and decided that it was not worth it. NAvi 22 aim for 1440P and not 1080P too

You're mixing things up, there isn't necessarily a relation between bus width and memory size, there is a relationship between bus width and memory configuration, it's size can be anything as long as it is the correct multiple. There is a relation between the bus width and the memory controllers because each controller has to match with a connection to one chip.

There is no fundamental relation between global memory (VRMA) and infinity cache either, it's up to AMD to decide how much memory they want, it's just a victim cache.

dragontamer5788 · Apr 19, 2021

I tested pointer-chasing on Vega 64 with ROCm, and got numbers in excess of 700 cycles (single-thread) and 1100 cycles (single-thread) to HBM2 RAM. Given that Vega is ~1.5 GHz, that's basically 500+ns of latency to main memory.

Aka: my numbers are grossly wrong compared to whatever they're doing in their tests. It'd be nice if people shared their pointer-chasing code, to ensure that we're actually comparing apples-to-apples. Both NVidia and AMD provide "count clock tick" functions in CUDA / ROCm, so these tests are actually rather easy to conduct. But if our code differs, then we will have grossly different results.

EDIT: My methodology is to use (https://danluu.com/sattolo/), aka Sattolo's algorithm to randomize a perfect cycle (a random list of pointers, where the cycle-length is the length of the list). Ex: if you have a list of size 100, then node = node->next will cycle after exactly 100 items. This prevents prefetching shenanigans on CPUs (and maybe GPUs, even though I don't think that GPUs have prefetching). Since its random and 100% dependent on the RAM, there's no way for the prefetch-predictor to actually predict the pointer-fetch ahead of time.

MxPhenom 216 · Apr 19, 2021

Punkenjoy said:
You are right. Actually making fast cache is way more complicated than it look. You have to have mechanism that will check the cache to know if the data you are trying to access is there. The larger the cache, the larger is the amount of work you have to do to figure out if it contain the data you are looking for.

This can add latency. The fact that even with more layer of cache, AMD is able to get lower latency show how well they master the cache thing. They purposely made a lot of effort there because this is a key thing with multi chips modules.

Right, cache hit miss logic

Punkenjoy · Apr 19, 2021

Vya Domus said:
You're mixing things up, there isn't necessarily a relation between bus width and memory size, there is a relationship between bus width and memory configuration, it's size can be anything as long as it is the correct multiple. There is a relation between the bus width and the memory controllers because each controller has to match with a connection to one chip.

There is no fundamental relation between global memory (VRMA) and infinity cache either, it's up to AMD to decide how much memory they want, it's just a victim cache.

note that my comment was an hypothesis because neither you or I really know if there is a link between these two. You say there is none but you have nothing that prove your assessment.

But the fact that we have:
NAVI 21, 256 bit, 128 MB IC
NAVI 22, 192 bit, 96 MB IC and if rumours are true
NAVI 23, 128 bit and 64 MB IC seems to point that there is a direct link between IC and memory bus size.

Steevo · Apr 19, 2021

Colddecked said:
did you mean Nvidia?

I did, but AMD does have the upper hand in 99th percent frame rates, IC improves their frame time significantly compared to their old idea of “more cores”.

Since IC runs at the same speed as core speed and it’s built in it makes sense AMD tuned its performance to match what the process and die needed, it’s all their own product.

chlamchowder · Apr 19, 2021

dragontamer5788 said:
I tested pointer-chasing on Vega 64 with ROCm, and got numbers in excess of 700 cycles (single-thread) and 1100 cycles (single-thread) to HBM2 RAM. Given that Vega is ~1.5 GHz, that's basically 500+ns of latency to main memory.

Aka: my numbers are grossly wrong compared to whatever they're doing in their tests. It'd be nice if people shared their pointer-chasing code, to ensure that we're actually comparing apples-to-apples. Both NVidia and AMD provide "count clock tick" functions in CUDA / ROCm, so these tests are actually rather easy to conduct. But if our code differs, then we will have grossly different results.

EDIT: My methodology is to use (https://danluu.com/sattolo/), aka Sattolo's algorithm to randomize a perfect cycle (a random list of pointers, where the cycle-length is the length of the list). Ex: if you have a list of size 100, then node = node->next will cycle after exactly 100 items. This prevents prefetching shenanigans on CPUs (and maybe GPUs, even though I don't think that GPUs have prefetching). Since its random and 100% dependent on the RAM, there's no way for the prefetch-predictor to actually predict the pointer-fetch ahead of time.

I'm the author of the Chips and Cheese article. To summarize the code (it's a bit of a mess, lots of opencl scaffolding):

I have two versions of the OpenCL kernel - one with the loop unrolled 10x (below), and a simple one. Timing is basically enqueue kernel, start timing, clFinish (block until command queue is finished), stop timing. Then divide elapsed time by number of chase iterations.

Code:

__kernel void unrolled_latency_test(__global const int* A, int count, __global int* ret) {
    int current = A[0];
    int result;
    for (int i = 0; i < count; i += 10)
    {
        result += current; 
        current = A[current];
        ... repeat the two lines above, 9 more times
    }

    ret[0] = result; // prevent compiler from optimizing it away
}

I filled the list to make the pointer chasing happen with a fixed stride. I observed that there was no prefetching at all on GPUs, so using a stride that exceeds the burst read (32B on Nvidia, 64B on AMD and Intel iGPUs) would be sufficient. I used a stride slightly over 4K to defeat CPU prefetchers, which won't prefetch past a 4K page boundary.

Code:

    uint32_t stride = 1211;
    int* A = (int*)malloc(sizeof(int) * list_size);
    for (int i = 0; i < list_size; i++)
    {
        A[i] = (i + stride) % list_size;
    }

Some tests with varying (small) stride:

Interestingly enough there's no L3-level prefetching on Intel, as latency goes all the way up to memory latency once you use a stride greater than 64B. For CPUs I had to use a much bigger stride before I got results comparable to other sites/tests.

At first I thought that part of our methodology might make a difference, so I tried out your access pattern (sattolo). It's interesting how Nvidia gets a smoother rise in latency with your methodology, once cache capacity is exceeded. But the conclusion for cache/mem latency doesn't change, so that's not it:

I got someone to test (stride) with a Vega 64, and memory latency was around 338.57ns (without loop unrolling). Some things to try, probably in order:

Run your code on a CPU as a sanity check, and make sure results are in the same neighborhood as that of other tests/review sites
Unroll the loop. Some GPUs have very high branch overhead and it's not hidden by memory latency. Terascale went down from ~680 ns to ~560 ns with a 10x unrolled loop. RDNA2 saw a smaller improvement (around 10ns less with unrolling). Don't expect the driver/compiler to do it for you - I checked the generated Terascale assembly and it didn't. Strangely this doesn't affect the Nvidia GPUs I tested, so either Nvidia has low branch overhead, or is able to overlap branch latency with cache/mem latency.
What region size are you testing? With very large test sizes you might be running into TLB miss penalties. Modern GPUs use virtual memory (you can even check out Pascal PTE fields here: gp100-mmu-format.pdf (nvidia.github.io))
Are you using 32-bit or 64-bit offsets? GPUs might be slower with 64-bit arithmetic, while 32-bit stuff is generally fast. Remember address calculation is part of latency here.
Check what the clock tick function is actually counting, if you want to use that? For example on CPUs, you have TSC (always counts at base clock), AMD APERF or Intel fixed counter 1 (counts number of unhalted cycles). Alternatively, run for a very large number of iterations to make the test run for at least a second or two, then use OS/library time measuring functions. That's what I did. It's more portable anyway.
Make sure your GPU isn't stuck in a low power state and running at very low frequency

Oh and finally, it's great to see other people writing tests for this stuff! Keep at it! It's easy to make mistakes, and the more independent implementations we have, the less likely it is that we'll make the same mistake.

dragontamer5788 · Apr 19, 2021

chlamchowder said:
I got someone to test (stride) with a Vega 64, and memory latency was around 338.57ns (without loop unrolling). Some things to try, probably in order:

I appreciate that. Here's my numbers and methodology.

* ROCm / HIP on Vega64
* 752 cycles (1 thread / workgroup), 1132 cycles (64 threads/workgroup). Suggesting a HBM2 latency of 461ns (1.63 GHz clock).

Code:

__global__ void test(uint32_t* buffer){
    clock_t start, end;
    uint32_t count=0;
    uint32_t ptr = buffer[0];

    start = clock();

    while(ptr != 0){
        count++;
        ptr = buffer[ptr];
    }

    end = clock();

    buffer[0] = count;
    buffer[1] = ((end-start) >> 32) & 0xFFFFFFFF;
    buffer[2] = (end-start) & 0xFFFFFFFF;
}

Code:

========================ROCm System Management Interface========================
================================================================================
GPU  Temp   AvgPwr  SCLK     MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
0    47.0c  61.0W   1630Mhz  945Mhz  17.65%  auto  220.0W    8%   100%  
================================================================================

SCLK of 1630 MHz / 1.63 GHz. "clock()" returns the number of clock ticks in ROCm, measured by the GPU itself. buffer[0], buffer[1], and buffer[2] overwrite the original data but whatever, simple enough to do, lol.

"count" gives a precise count to the number of memory references. ptr = buffer[ptr] is the pointer-chasing code. buffer[] was setup using Sattolo's algorithm, as described before.

---------

Run your code on a CPU as a sanity check, and make sure results are in the same neighborhood as that of other tests/review sites

Check. CPU was ~10ns for L3 cache, the number I expected.

Unroll the loop. Some GPUs have very high branch overhead and it's not hidden by memory latency. Terascale went down from ~680 ns to ~560 ns with a 10x unrolled loop. RDNA2 saw a smaller improvement (around 10ns less with unrolling). Don't expect the driver/compiler to do it for you - I checked the generated Terascale assembly and it didn't. Strangely this doesn't affect the Nvidia GPUs I tested, so either Nvidia has low branch overhead, or is able to overlap branch latency with cache/mem latency.

I did not unroll the loop. I checked the assembly, and no unrolling / optimization seems to have been done. Based on my understanding of memory-latency, the AMD Vega64 should have been mostly stalling on the "s_waitcnt" instruction (wait for memory). I expect GPUs to overlap branch latency with cache/mem latency, but you're right that its not necessarily guaranteed. But you're right in that the loop could very well be part of the measured latency equation. I'll give that a shot later.

What region size are you testing? With very large test sizes you might be running into TLB miss penalties. Modern GPUs use virtual memory (you can even check out Pascal PTE fields here: gp100-mmu-format.pdf (nvidia.github.io))

Various regions, from 1kB to 4GBs. I forgot the size for the 752 cycle number precisely.

Are you using 32-bit or 64-bit offsets? GPUs might be slower with 64-bit arithmetic, while 32-bit stuff is generally fast. Remember address calculation is part of latency here.

32-bit offset from a pointer. I'll have to check the assembly instructions to see if it compiles into a 64-bit offset.

Check what the clock tick function is actually counting, if you want to use that? For example on CPUs, you have TSC (always counts at base clock), AMD APERF or Intel fixed counter 1 (counts number of unhalted cycles). Alternatively, run for a very large number of iterations to make the test run for at least a second or two, then use OS/library time measuring functions. That's what I did. It's more portable anyway.

clock() compiles into the Vega-ISA clock-tick function, either S_MEMTIME or S_MEMREALTIME. I admit that I forget off the top of my head, but it was a single-instruction.

As you can see from the ROCm code, I run the "clock" over the entire set of iterations, and then divide later (end-start) / count in some later code.

Make sure your GPU isn't stuck in a low power state and running at very low frequency

I checked the status of the GPU during the benchmark. The GPU goes up to 1630 MHz during the benchmark, and drops down to 1000MHz or so after the benchmark.

medi01 · Apr 19, 2021

This is exactly the title I'd expect when AMD is wiping the floor with NV at mem latency.

TheinsanegamerN said:
I mean outside of raytracing, the 3080 loses to the 6900xt and 6800xt at 1440p,

It tends to lose in RT in newer games, created with RDNA2 in mind:

Fortnight
Dirt 5
WoW RT

Godfall is a close call.

Chrispy_ said:
TBH the cache in RDNA2 is less about performance this gen and more about setting up for chiplets. It's not 100% useless but IPC differences between the 6700XT and similar 5700XT without the cache are really low. Sometimes zero, sometimes negligible. The performance uplift is almost entirely down to the 25-30% increase in clockspeeds.

It's a marketing point for now, that will lay the work for MCM GPUs next gen. Presumably it makes things smoother for raytracing two as the calculations now involve lookups for more data than that just relevant to the pixels any particular CU is working on, ergo more data being required - but for traditional raster based stuff the HWU video above proves how little it's of benefit to this generation.

You are comapring wrong things.

Mem bandwidth on 5700XT is 448GB/s
Mem bandwidth on 6700XT is just 384GB/s

See, what that cache is doing?

chlamchowder · Apr 19, 2021

THANATOS said:
Care to elaborate how big impact latency has on a GPU performance?
BTW Nvidia has higher bandwidth than AMD and in high end(GA102) It's significantly higher, but you ignore this.

As I mentioned in the original article (Measuring GPU Memory Latency – Chips and Cheese), latency starts to matter when occupancy (available parallelism) is too low to hide latency. Exactly how often that happens and to what extent is going to vary a lot with workload.

I did a batch of frame captures via Nvidia Nsight, and here's some data:

In all cases there's a lot of warps waiting on memory, but clearly that's not the whole story. Superposition for example still has decent throughput despite having so many warps stalled on memory latency, because there are a lot of pending warps the scheduler can choose from to feed the execution units. Ungine Heaven stalls less on mem latency especially at 4K, but gets texture bound.

Anyway it's very hard to say for sure without profiling a workload, and even then interpreting the results is tricky. I suspect that GPUs like GA102 and Navi 21 are big enough that some workloads (particularly at 1080P) might not provide enough parallelism. But that's a guess in the end, since I don't have either of those GPUs to play with.

dragontamer5788 said:
I appreciate that. Here's my numbers and methodology.

* ROCm / HIP on Vega64
* 752 cycles (1 thread / workgroup), 1132 cycles (64 threads/workgroup). Suggesting a HBM2 latency of 461ns (1.63 GHz clock).

Code:

__global__ void test(uint32_t* buffer){ clock_t start, end; uint32_t count=0; uint32_t ptr = buffer[0]; start = clock(); while(ptr != 0){ count++; ptr = buffer[ptr]; } end = clock(); buffer[0] = count; buffer[1] = ((end-start) >> 32) & 0xFFFFFFFF; buffer[2] = (end-start) & 0xFFFFFFFF; }

Code:

========================ROCm System Management Interface======================== ================================================================================ GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 47.0c 61.0W 1630Mhz 945Mhz 17.65% auto 220.0W 8% 100% ================================================================================

SCLK of 1630 MHz / 1.63 GHz. "clock()" returns the number of clock ticks in ROCm, measured by the GPU itself. buffer[0], buffer[1], and buffer[2] overwrite the original data but whatever, simple enough to do, lol.

"count" gives a precise count to the number of memory references. ptr = buffer[ptr] is the pointer-chasing code. buffer[] was setup using Sattolo's algorithm, as described before.

---------

Check. CPU was ~10ns for L3 cache, the number I expected.

I did not unroll the loop. I checked the assembly, and no unrolling / optimization seems to have been done. Based on my understanding of memory-latency, the AMD Vega64 should have been mostly stalling on the "s_waitcnt" instruction (wait for memory). I expect GPUs to overlap branch latency with cache/mem latency, but you're right that its not necessarily guaranteed. But you're right in that the loop could very well be part of the measured latency equation. I'll give that a shot later.

Various regions, from 1kB to 4GBs. I forgot the size for the 752 cycle number precisely.

32-bit offset from a pointer. I'll have to check the assembly instructions to see if it compiles into a 64-bit offset.

clock() compiles into the Vega-ISA clock-tick function, either S_MEMTIME or S_MEMREALTIME. I admit that I forget off the top of my head, but it was a single-instruction.

As you can see from the ROCm code, I run the "clock" over the entire set of iterations, and then divide later (end-start) / count in some later code.

I checked the status of the GPU during the benchmark. The GPU goes up to 1630 MHz during the benchmark, and drops down to 1000MHz or so after the benchmark.

Ugh, accidentally edited my previous reply out because TPU combined the posts and I thought it double posted.

Anyway, Vega_Shader_ISA_28July2017.pdf (amd.com) says S_MEMREALTIME doesn't count at core clocks. If clock() gets compiled to that, you'll need to determine what it's counting in proportion to. S_MEMTIME appears to be core clocks?

The 338.57 ns Vega 64 figure was for a 128 MB region. That person also overclocked to 1682 core, 1060 HBM and got 313.29 ns.

Our pointer chasing loops differ in that your loop branch condition depends on the memory load's result, meaning it can't overlap with memory latency. It'd have to be part of measured latency in your test, even on a CPU with OOO execution. But on a CPU it's not going to make more than a cycle or two of difference. I don't know about GPUs.

dragontamer5788 · Apr 20, 2021

chlamchowder said:
The 338.57 ns Vega 64 figure was for a 128 MB region. That person also overclocked to 1682 core, 1060 HBM and got 313.29 ns.

I just tested 634 clock ticks on a 16MB region, or a ~388ns figure. Which is at least in the same region that you've tested. I tried with -O2 and no-optimizations, didn't change anything.

Our pointer chasing loops differ in that your loop branch condition depends on the memory load's result, meaning it can't overlap with memory latency. It'd have to be part of measured latency in your test, even on a CPU with OOO execution. But on a CPU it's not going to make more than a cycle or two of difference. I don't know about GPUs.

That's probably it. To provide for more adequate "loop-hiding", I changed the code into the following:

Code:

    start = clock();

    while(count != runSize){
        count++;
        ptr = myBuffer[ptr];
    }

    end = clock();

And this dropped the cycle count down to 577 cycles, or 353 nanoseconds. I needed to add "buffer[3] = ptr" to prevent the loop from being optimized away (time == 0 for a few tries, lol). Running multiple times makes this loop 60-clock ticks (~36 nanoseconds) faster on the average.

Which is pretty close to your result for a 16MB region. I'm still using clock() on the actual GPU instead of the CPU-clock, so I'll start investigating / disassembling the code to see if its S_MEMTIME or whatever.

Its not yet perfectly replicated, but maybe there's a fundamental clock-speed difference (either HBM2 or GPU-clock) which could very well be the last 15-nanoseconds worth of difference.

-----------

Notes:

1. My initial 700+ cycle / 500 nanosecond result was from ROCm 3.x last year. Today, with a change to ROCm 4.0, it seems like the code might be faster (consistently under 700 cycles when I run the unedited HIP code). So maybe AMD had a few device driver tweaks that made things faster under the hood?

2. The loop is clearly unrolled now.

Code:

0000000000001038 <BB0_1>:
    s_waitcnt vmcnt(0)                                         // 000000001038: BF8C0F70
    v_lshlrev_b64 v[5:6], 2, v[1:2]                            // 00000000103C: D28F0005 00020282
    s_sub_i32 s4, s4, 32                                       // 000000001044: 8184A004
    v_add_co_u32_e32 v5, vcc, v3, v5                           // 000000001048: 320A0B03
    v_addc_co_u32_e32 v6, vcc, v4, v6, vcc                     // 00000000104C: 380C0D04
    global_load_dword v1, v[5:6], off                          // 000000001050: DC508000 017F0005
    s_cmp_lg_u32 s4, 0                                         // 000000001058: BF078004
    s_waitcnt vmcnt(0)                                         // 00000000105C: BF8C0F70
    v_lshlrev_b64 v[5:6], 2, v[1:2]                            // 000000001060: D28F0005 00020282
    v_add_co_u32_e32 v5, vcc, v3, v5                           // 000000001068: 320A0B03
    v_addc_co_u32_e32 v6, vcc, v4, v6, vcc                     // 00000000106C: 380C0D04
    global_load_dword v1, v[5:6], off                          // 000000001070: DC508000 017F0005
    s_waitcnt vmcnt(0)                                         // 000000001078: BF8C0F70
    v_lshlrev_b64 v[5:6], 2, v[1:2]                            // 00000000107C: D28F0005 00020282
    v_add_co_u32_e32 v5, vcc, v3, v5                           // 000000001084: 320A0B03
    v_addc_co_u32_e32 v6, vcc, v4, v6, vcc                     // 000000001088: 380C0D04
    global_load_dword v1, v[5:6], off                          // 00000000108C: DC508000 017F0005
    s_waitcnt vmcnt(0)                                         // 000000001094: BF8C0F70
    v_lshlrev_b64 v[5:6], 2, v[1:2]                            // 000000001098: D28F0005 00020282

You can see the "global_load_dword" call, as well as the repeated s_waitcnt vmcnt(0) (wait for outstanding memory-count zero). The global_load is asynchronous: it starts the memory load "in the background", while the s_waitcnt actually forces the GPU-core to wait for RAM.

So maybe the device driver change was just the compiler-upgrade to better unroll this loop automatically?

3. "global_load_dword v1, v[5:6], off" shows that the memory load is a 32-bit operation + a 64-bit base register v[5:6] is the base register (myBuffer). So that answers that question we had earlier...

---------------

Notes:

* Today I'm running ROCm 4.0
* extractkernel tool to grab the disassembly
* constexpr uint64_t runSize = (1 << 22); (4-million numbers x 4-bytes each == 16MBs tested).

chlamchowder said:
As I mentioned in the original article (Measuring GPU Memory Latency – Chips and Cheese), latency starts to matter when occupancy (available parallelism) is too low to hide latency. Exactly how often that happens and to what extent is going to vary a lot with workload.

[snip]

Anyway it's very hard to say for sure without profiling a workload, and even then interpreting the results is tricky. I suspect that GPUs like GA102 and Navi 21 are big enough that some workloads (particularly at 1080P) might not provide enough parallelism. But that's a guess in the end, since I don't have either of those GPUs to play with.

Agreed. But here's some notes.

* Vega (and older) are 16 x 4 architecturally. That is: each vALU of the CU has 16-wide execution units, and those execution units repeat EVERY instruction 4-times, resulting in your 64-wide programming model.

* NVidia and RDNA are 32x1. The CUs are 32-wide naturally, and do none of the "repetition" stuff. As such, NVidia and RDNA are better tailored for low-occupancy situations.

* NVidia and RDNA both have stalls: NVidia tracks stalls at the assembly level. RDNA handles stalls automatically.

* GPUs are all 32-wide or above from a programming model. That means even the smallest of reasonable workloads is 32-wide. If not, no one would even bother with the whole shader / DirectX / HLSL crap and just program the stuff on your big, beefy CPU already optimized for single-threaded data. Additional CUDA blocks or OpenCL workgroups can provide additional latency hiding, but it should be noted that a single-warp / single-wavefront (with a good enough compiler and enough Instruction-level-parallelism at the C/C++ level) can in fact hide latency.

As such, occupancy doesn't really mean much unless we know exactly the code that was being run. In fact, higher occupancy may be a bad thing: GPUs have to share their vGPRs / registers with other "occupants". So going occupancy 2, or occupancy 4+ means having 1/2 or 1/4th the registers. So the optimization of code (registers vs occupancy) is a very complicated problem.

chlamchowder · Apr 20, 2021

dragontamer5788 said:
I just tested 634 clock ticks on a 16MB region, or a ~388ns figure. Which is at least in the same region that you've tested. I tried with -O2 and no-optimizations, didn't change anything.

That's probably it. To provide for more adequate "loop-hiding", I changed the code into the following:

Code:

start = clock(); while(count != runSize){ count++; ptr = myBuffer[ptr]; } end = clock();

And this dropped the cycle count down to 577 cycles, or 353 nanoseconds. I needed to add "buffer[3] = ptr" to prevent the loop from being optimized away (time == 0 for a few tries, lol). Running multiple times makes this loop 60-clock ticks (~36 nanoseconds) faster on the average.

Oh goodie. at least we're in the same ballpark now. That last bit could be a clock speed difference, yeah.

16 MB worries me a bit, since Vega 64 has a 4 MB L2. With a random access pattern, you may get (un)lucky 1/4 of the time and hit cache, which is why I used 128 MB as a reference point for mem latency (or 1 GB for RDNA 2 since it went that far before mem latency leveled off). That seems to be more of a problem with Sattolo's algorithm than using a fixed stride, judging by how the GTX 980 Ti saw a slower jump in latency after going past 3 MB:

dragontamer5788 said:
Which is pretty close to your result for a 16MB region. I'm still using clock() on the actual GPU instead of the CPU-clock, so I'll start investigating / disassembling the code to see if its S_MEMTIME or whatever.

Its not yet perfectly replicated, but maybe there's a fundamental clock-speed difference (either HBM2 or GPU-clock) which could very well be the last 15-nanoseconds worth of difference.

-----------

Notes:

1. My initial 700+ cycle / 500 nanosecond result was from ROCm 3.x last year. Today, with a change to ROCm 4.0, it seems like the code might be faster (consistently under 700 cycles when I run the unedited HIP code). So maybe AMD had a few device driver tweaks that made things faster under the hood?

2. The loop is clearly unrolled now.

Code:

0000000000001038 <BB0_1>: s_waitcnt vmcnt(0) // 000000001038: BF8C0F70 v_lshlrev_b64 v[5:6], 2, v[1:2] // 00000000103C: D28F0005 00020282 s_sub_i32 s4, s4, 32 // 000000001044: 8184A004 v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001048: 320A0B03 v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 00000000104C: 380C0D04 global_load_dword v1, v[5:6], off // 000000001050: DC508000 017F0005 s_cmp_lg_u32 s4, 0 // 000000001058: BF078004 s_waitcnt vmcnt(0) // 00000000105C: BF8C0F70 v_lshlrev_b64 v[5:6], 2, v[1:2] // 000000001060: D28F0005 00020282 v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001068: 320A0B03 v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 00000000106C: 380C0D04 global_load_dword v1, v[5:6], off // 000000001070: DC508000 017F0005 s_waitcnt vmcnt(0) // 000000001078: BF8C0F70 v_lshlrev_b64 v[5:6], 2, v[1:2] // 00000000107C: D28F0005 00020282 v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001084: 320A0B03 v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 000000001088: 380C0D04 global_load_dword v1, v[5:6], off // 00000000108C: DC508000 017F0005 s_waitcnt vmcnt(0) // 000000001094: BF8C0F70 v_lshlrev_b64 v[5:6], 2, v[1:2] // 000000001098: D28F0005 00020282

You can see the "global_load_dword" call, as well as the repeated s_waitcnt vmcnt(0) (wait for outstanding memory-count zero). The global_load is asynchronous: it starts the memory load "in the background", while the s_waitcnt actually forces the GPU-core to wait for RAM.

So maybe the device driver change was just the compiler-upgrade to better unroll this loop automatically?

3. "global_load_dword v1, v[5:6], off" shows that the memory load is a 32-bit operation + a 64-bit base register v[5:6] is the base register (myBuffer). So that answers that question we had earlier...

Cool. Good to see ROCm unrolling the loop. Maybe Nvidia does that under the hood as well because I get identical results with and without unrolling, but I have no way to extract assembly from compiled OpenCL kernels. Only AMD CodeXL does that, and I can only get that working on Terascale. Ugh.

dragontamer5788 said:
Agreed. But here's some notes.

* Vega (and older) are 16 x 4 architecturally. That is: each vALU of the CU has 16-wide execution units, and those execution units repeat EVERY instruction 4-times, resulting in your 64-wide programming model.

The way I saw it was each 16-wide SIMD can execute a 64-wide wavefront with latency = reciprocal throughput = 4 cycles.

dragontamer5788 said:
* NVidia and RDNA are 32x1. The CUs are 32-wide naturally, and do none of the "repetition" stuff. As such, NVidia and RDNA are better tailored for low-occupancy situations.

Yep, if you only have one wavefront active in a CU, you're limited to 1 wavefront every 4 cycles. On RDNA and Nvidia Pascal/Ampere, you get one warp every cycle. On Nvidia Turing, you get one warp every two cycles.

dragontamer5788 said:
* NVidia and RDNA both have stalls: NVidia tracks stalls at the assembly level. RDNA handles stalls automatically.

You mean execution stalls? Nvidia encodes stall cycles into static scheduling info, visible in disassembly in words without associated instructions for Kepler/Maxwell/Pascal, and tacked on to instructions in Turing/Ampere. Fermi/GCN/RDNA track execution dependencies in hardware.

For variable latency (memory) stalls, both use barriers like s_waitcnt. It's just part of static scheduling info on Nvidia.

dragontamer5788 said:
* GPUs are all 32-wide or above from a programming model. That means even the smallest of reasonable workloads is 32-wide. If not, no one would even bother with the whole shader / DirectX / HLSL crap and just program the stuff on your big, beefy CPU already optimized for single-threaded data. Additional CUDA blocks or OpenCL workgroups can provide additional latency hiding, but it should be noted that a single-warp / single-wavefront (with a good enough compiler and enough Instruction-level-parallelism at the C/C++ level) can in fact hide latency.

As such, occupancy doesn't really mean much unless we know exactly the code that was being run. In fact, higher occupancy may be a bad thing: GPUs have to share their vGPRs / registers with other "occupants". So going occupancy 2, or occupancy 4+ means having 1/2 or 1/4th the registers. So the optimization of code (registers vs occupancy) is a very complicated problem.

Yes that's true, you can get ILP within a single wavefront/warp through nonblocking loads and pipelined execution. Also Intel Gen is unique here - it can use variable SIMD widths from 4-wide to 32-wide.

I got achieved occupancy through Nsight. If occupancy is limited by vGPR or shmem(lds) pressure, I'd just see lower occupancy. That may be why Ungine Heaven gets lower occupancy at 4K than 1080P, even though there should be more parallelism available at 4K.

Minus Infinity · Apr 20, 2021

No doubt Lovelace will have a big cache after seeing how well infinity cache did on RDNA2.

dragontamer5788 · Apr 20, 2021

chlamchowder said:
16 MB worries me a bit, since Vega 64 has a 4 MB L2.

Agreed. But my code fails to compile for some reason with 128MB, lol. So I just did a test at 16MB and didn't feel like debugging. Its imperfect, but at least I got ya an answer quicker

That seems to be more of a problem with Sattolo's algorithm than using a fixed stride, judging by how the GTX 980 Ti saw a slower jump in latency after going past 3 MB:

Yeah. Agreed.

-------

I'm a bit curious about the slowdown at 2kB.

I would guess that the slowdown of Sattolo's at 2kB is because maybe... the code-cache may share some cache with the data-cache? So maybe the 4kB region is data+code?? Just guessing here.

chlamchowder · Apr 20, 2021

dragontamer5788 said:
Agreed. But my code fails to compile for some reason with 128MB, lol. So I just did a test at 16MB and didn't feel like debugging. Its imperfect, but at least I got ya an answer quicker

Did you statically allocate memory? You could try playing with mcmodel if you're using gcc. Or just do malloc/free.

dragontamer5788 said:
Yeah. Agreed.

-------

I'm a bit curious about the slowdown at 2kB.

I would guess that the slowdown of Sattolo's at 2kB is because maybe... the code-cache may share some cache with the data-cache? So maybe the 4kB region is data+code?? Just guessing here.

2 KB? I didn't test with any size smaller than 8 KB because no GPU has a L1 cache smaller than that. The slightly higher latency at 8 KB is probably noise.

I don't think code and data share the same cache until L2. On Nvidia, constant data and code share a L1.5 cache, which is 32 KB per SM on GP104 according to 1903.07486.pdf (arxiv.org), but it looks like OpenCL sees none of that (thanks NV). That's also a fun read by the way. They say GP104 L2 hit latency is ~216c using CUDA, handwritten SASS, and %%clock for timing. I get ~266c from my higher level OpenCL test.

dragontamer5788 · Apr 20, 2021

chlamchowder said:
Did you statically allocate memory? You could try playing with mcmodel if you're using gcc. Or just do malloc/free.

Statically allocated and then had a x64 all over the place, to handle the scenario where 64-threads (erm... work-items?) work in parallel. Twas a magic number, so I didn't feel like double-checking all the "* 64" all over my code to make sure which ones were necessary to change.

Switching it to a malloc/free probably would have worked. But alas, its late and time for bed, lol. Needless to say, the code is messy and primarily geared at the x64 vs x1 scenario (which is what I was most interested in: how the memory bandwidth / latency scaled with higher work-item counts).

Anyway, maybe I'll try that tomorrow.

chlamchowder said:
2 KB? I didn't test with any size smaller than 8 KB because no GPU has a L1 cache smaller than that. The slightly higher latency at 8 KB is probably noise.

Ah, I see my mistake. That's 2MB vs 4MBs. Gotcha. I thought it was 2kB and 4kB there, but I guess I didn't read the "KB" correctly.

So yeah, the 3MB test-point, where Sattolo's method is slightly slower. Maybe that's instruction-cache taking up some space? I dunno what else it could be.

Jism · Apr 20, 2021

john_ said:
This probably shows AMD's better experience with caches, considering that their main business is CPUs. On the other hand it shows how much faster architecture Nvidia's is, that even with higher cache latencies it performs better.

Wrong. But anyway: nvidia uses texture compression to proberly compensate for the bit higher latency.

mtcn77 · Apr 20, 2021

Jism said:
Wrong. But anyway: nvidia uses texture compression to proberly compensate for the bit higher latency.

I see an Indian if I have ever seen one.

I agree with you. Nvidia's texture cache isn't so easily harnessed, even developers ask around for that.

chlamchowder · Apr 20, 2021

mtcn77 said:
I see an Indian if I have ever seen one.
I agree with you. Nvidia's texture cache isn't so easily harnessed, even developers ask around for that.

Those two aren't directly related. Texture compression saves memory bandwidth, but doesn't change latency (and might increase it because compression/decompression may take extra cycles).

The L1 texture cache on Kepler/Maxwell/Pascal not showing up in my tests is a different issue. You have to opt-in to use it in CUDA by passing "-Xptxas -dlcm=ca" (without quotes) on the nvcc command line. I tried for a while to make it work in CUDA with zero success. I got zero L1 tex cache hits even with those parameters passed to nvcc. And there's no way to use the L1 tex cache at all from OpenCL.

Jism · Apr 20, 2021

The texture compression compensates for the little bit bigger latency.

Vya Domus · Apr 20, 2021

Jism said:
Wrong. But anyway: nvidia uses texture compression to proberly compensate for the bit higher latency.

You mean color compression ? Every GPU uses color compression these days, even the one in your phone.

Doc-J · Apr 20, 2021

Steevo said:
Smoothness

AMD Radeon RX 6900 XT Review - The Biggest Big Navi

AMD's Radeon RX 6900 XT offers convincing 4K gaming performance, yet stays below 300 W. Thanks to this impressive efficiency, the card is almost whisper-quiet, quieter than any RTX 3090 we've ever tested. In our review, we not only benchmark the RX 6900 XT on Intel, but also on Zen 3, with fast...

www.techpowerup.com

See the effects of the Infinity Cache in the charts, 3090 & 6900 trade for FPS, but the 6900 has consistently higher frame rates and fewer low FPS frames, which equates to less laggy feeling, IE... smoothness

Yes, "placebo" effect, 2 of 10 games runs better smooth on 6900, the other 8 runs equal or better on RTX3000 (Including heavy game to render like RDR2).

1d10t said:
This explains why RX 6800 series is a serious competitor at 1080p and up to 1440p, even though the Ampere has a much wider GDDR6x memory bandwidth. Oh and some YouTubers have also said that playing on the RX 6800 is smoother, so there another perks you can't measure.

All information by "Paid" Linus channel is useless....
Only see his face when reviewing the RTX3000 cards said it all XD

Kayotikz · Apr 20, 2021

1d10t said:
Again, have you watched the video? There's also a frame counter in the top right corner. Here's link to save your time Linus

........

Do you need someone to comprehend a sentence for you?

FPS ≠ Frametime

If you need me to break it down for you even more..... You should probably pick up a comprehension class

mtcn77 · Apr 20, 2021

chlamchowder said:
Those two aren't directly related. Texture compression saves memory bandwidth, but doesn't change latency (and might increase it because compression/decompression may take extra cycles).

The L1 texture cache on Kepler/Maxwell/Pascal not showing up in my tests is a different issue. You have to opt-in to use it in CUDA by passing "-Xptxas -dlcm=ca" (without quotes) on the nvcc command line. I tried for a while to make it work in CUDA with zero success. I got zero L1 tex cache hits even with those parameters passed to nvcc. And there's no way to use the L1 tex cache at all from OpenCL.

Thanks for your feedback. I knew it to be true because it isn't the first time a developer had difficulty with texture cache - which is L0 - and it is better to be open about it than to assume, like other members here, that the results are final. They are not and there is nothing wrong about it since we are only trying to cast light into a matter, not aiming at unsubstantiated fanfare.
Been there done that for others that consider AMD the best. AMD has a lot to go in order to make themselves in the lead. They have a crosslicense with Nvidia which means anything Nvidia has put forward in the past, they have access to. Like the same, AMD will grant access to Nvidia for their own innovations in the future.
First, we have to see AMD lead and then stay there are as Nvidia starts to pick up the spoils of AMD's success. I wouldn't say they didn't do much, but a lot has come from their Nvidia partners to discount the influence, imo.

Vya Domus said:
You mean color compression ? Every GPU uses color compression these days, even the one in your phone.

It is different. You can use compression to target buffered writes which reduce bandwidth call, or improve bandwidth utilisation to memory devices; however way you want to look at it. It cannot buffer random color data, but for static bits of color information, ROP units can churn more pixels.
The difference with the way Nvidia does it is, up until recently - I mean until Vega debuted - they were the first with tile based rendering and buffered memory access. When accesses to memory are buffered it means direct access "immediate mode" is gone and the gpu won't access memory and instead accesses the L2. You have to store such data in buffers order to send them in packets to memory and thats where texture cache comes in. L0 buffers textures, then sends them over to L2, ROP units then issue writes to memory, always in the same buffering protocol, so there is no sparse access, the code bubbles, or rasterizer bubbles I should say, don't cause memory access bubbles, since the interface is detached from the execution units and only access fully buffered orders.
Not an engineer, this is the best I can make of it.

RH92 · Apr 20, 2021

1d10t said:
Have you watched the video? It's called placebo effect, have you invented a tool to measure it?

Have you read my response ? Im not talking about the placebo effect ( which describes the subjective perception ) , im talking about the objective perception of ''smoothness'' in games which can be accurately extrapolated from frametime measurements ! There is a reason all major reviewers have integrated frametime measurements in their reviews .........

Chrispy_ · Apr 21, 2021

medi01 said:
You are comapring wrong things.

Mem bandwidth on 5700XT is 448GB/s
Mem bandwidth on 6700XT is just 384GB/s

See, what that cache is doing?

Not really, it's been proven by a whole army of undervolters and underclockers (of which I'm one) that the 5700XT has far more bandwidth than it can use.

To put it into perspective, the 5600XT has a 25% reduction in memory bandwidth over the vanilla 5700 (most of them use the same 14Gbps GDDR6 as the 5700), and it also shares the same 2304 shader configuration. Yet, despite similar game clocks, it is within 5% of the performance, and only at 4K where the GPU doesn't have enough shader strength anyway - it struggles at 4K even in 2019 games of its day. So the extra bandwidth of the 5700 and 5700XT is academic in pretty much any game past and present. You could double it to 900GB/s and the 5700XT still wouldn't be any faster.

So let's be clear, if you underclocked the VRAM on a 5700XT to bring its bandwidth down to the 384GB/s of the 6700XT, I'd bet money on the performance drop being almost undetectable - within margin of error most likely. The 448GX/s of the 5700XT is a paper spec that only rears its head in synthetic VRAM bandwidth-measuring tests or Etherium mining where bandwidth is all that matters (and miners would turn the GPU core clocks down to 1MHz to save power if that were even possible).

Colddecked said:
But there must be some architectural difference that allows RDNA2 to clock that much higher at the same voltage. It can't just be AMD just got that good at 7nm, can it?

I'm not a GPU architect but from having read several interviews of GPU architects and deep-dives into architecture, it's clear that you can design more efficient architecture that does more stuff at a given clockspeed - this is what drives architectural process, and you can tweak previous generation architecture layouts with the hindsight of which areas were limiting clockspeeds in that generation.

One is a whole new architecture designed to be better than the old architecture, the other is a band-aid that refines the old architecture by picking the low-hanging fruit. I'm not saying that either approach is bad, but you can only take the band-aid approach once; If you tweak the architecture properly to optimise layout based on previous gen experience, there's no low-hanging fruit to pick for a second go around.

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	Main Stack
Processor	AMD Ryzen 7 9800X3D
Motherboard	Asus X870 ROG Strix-A - White
Cooling	Air (temporary until 9070xt blocks are available)
Memory	G. Skill Royal 2x24GB 6000Mhz C26
Video Card(s)	Powercolor Red Devil Radeon 9070XT 16G
Storage	Samsung 9100 Gen5 1TB \| Samsung 980 Pro 1TB (Games_1) \| Lexar NM790 2TB (Games_2)
Display(s)	Asus XG27ACDNG 360Hz QD-OLED \| Gigabyte M27Q-P 165Hz 1440P IPS \| LG 24" 1440 IPS 1440p
Case	HAVN HS420 - White
Audio Device(s)	FiiO K7 \| Sennheiser HD650 + Beyerdynamic FOX Mic
Power Supply	Corsair RM1000x ATX 3.1
Mouse	Razer Viper v3 Pro
Keyboard	Corsair K65 Plus 75% Wireless - USB Mode
Software	Windows 11 Pro 64-Bit

System Name	Compy 386
Processor	7800X3D
Motherboard	Asus
Cooling	Air for now.....
Memory	64 GB DDR5 6400Mhz
Video Card(s)	7900XTX 310 Merc
Storage	Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s)	55" Samsung 4K HDR
Audio Device(s)	ATI HDMI
Mouse	Logitech MX518
Keyboard	Razer
Software	A lot.
Benchmark Scores	Its fast. Enough.

Processor	i5-6600K
Motherboard	Gigabyte Z710-HD3
Cooling	H60
Memory	32 GB DDR4-2133
Video Card(s)	EVGA 980 Ti
Storage	BX100 1 TB
Display(s)	Acer S277HK
Case	Fractal Design R5
Power Supply	EVGA 750W B2
Mouse	Logitech G400s
Keyboard	MS Sidewinder X4

System Name	M3401 notebook
Processor	5600H
Motherboard	NA
Memory	16GB
Video Card(s)	3050
Storage	500GB SSD
Display(s)	14" OLED screen of the laptop
Software	Windows 10
Benchmark Scores	3050 scores good 15-20% lower than average, despite ASUS's claims that it has uber cooling.

GPU Memory Latency Tested on AMD's RDNA 2 and NVIDIA's Ampere Architecture

Vya Domus

dragontamer5788

MxPhenom 216

ASIC Engineer

Punkenjoy

Steevo

chlamchowder

dragontamer5788

medi01

chlamchowder

dragontamer5788

chlamchowder

Minus Infinity

dragontamer5788

chlamchowder

dragontamer5788

Jism

mtcn77

chlamchowder

Jism

Vya Domus

Doc-J

AMD Radeon RX 6900 XT Review - The Biggest Big Navi

Kayotikz

New Member

mtcn77

RH92

Chrispy_

Processor	RYZEN 7 5800X3D
Motherboard	Aorus B-550I Pro AX
Cooling	HEATKILLER IV PRO , EKWB Vector FTW3 3080/3090 , Barrow res + Xylem DDC 4.2, SE 240 + Dabel 20b 240
Memory	Viper Steel 4000 PVS416G400C6K
Video Card(s)	EVGA 3080Ti FTW3
Storage	XPG SX8200 Pro 512 GB NVMe + Samsung 980 1TB
Display(s)	ROG Strix OLED XG27AQDMG
Case	NR 200
Power Supply	CORSAIR SF750
Mouse	Logitech G PRO
Keyboard	Meletrix Zoom 75 GT Silver
Software	Windows 11 22H2

System Name	Bragging Rights
Processor	Atom Z3735F 1.33GHz
Motherboard	It has no markings but it's green
Cooling	No, it's a 2.2W processor
Memory	2GB DDR3L-1333
Video Card(s)	Gen7 Intel HD (4EU @ 311MHz)
Storage	32GB eMMC and 128GB Sandisk Extreme U3
Display(s)	10" IPS 1280x800 60Hz
Case	Veddha T2
Audio Device(s)	Apparently, yes
Power Supply	Samsung 18W 5V fast-charger
Mouse	MX Anywhere 2
Keyboard	Logitech MX Keys (not Cherry MX at all)
VR HMD	Samsung Oddyssey, not that I'd plug it into this though....
Software	W10 21H1, barely
Benchmark Scores	I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.