The 338.57 ns Vega 64 figure was for a 128 MB region. That person also overclocked to 1682 core, 1060 HBM and got 313.29 ns.
I just tested 634 clock ticks on a 16MB region, or a ~388ns figure. Which is at least in the same region that you've tested. I tried with -O2 and no-optimizations, didn't change anything.
Our pointer chasing loops differ in that your loop branch condition depends on the memory load's result, meaning it can't overlap with memory latency. It'd have to be part of measured latency in your test, even on a CPU with OOO execution. But on a CPU it's not going to make more than a cycle or two of difference. I don't know about GPUs.
That's probably it. To provide for more adequate "loop-hiding", I changed the code into the following:
Code:
start = clock();
while(count != runSize){
count++;
ptr = myBuffer[ptr];
}
end = clock();
And this dropped the cycle count down to 577 cycles, or 353 nanoseconds. I needed to add "buffer[3] = ptr" to prevent the loop from being optimized away (time == 0 for a few tries, lol). Running multiple times makes this loop 60-clock ticks (~36 nanoseconds) faster on the average.
Which is pretty close to your result for a 16MB region. I'm still using clock() on the actual GPU instead of the CPU-clock, so I'll start investigating / disassembling the code to see if its S_MEMTIME or whatever.
Its not yet perfectly replicated, but maybe there's a fundamental clock-speed difference (either HBM2 or GPU-clock) which could very well be the last 15-nanoseconds worth of difference.
-----------
Notes:
1. My initial 700+ cycle / 500 nanosecond result was from ROCm 3.x last year. Today, with a change to ROCm 4.0, it seems like the code might be faster (consistently under 700 cycles when I run the unedited HIP code). So maybe AMD had a few device driver tweaks that made things faster under the hood?
2. The loop is
clearly unrolled now.
Code:
0000000000001038 <BB0_1>:
s_waitcnt vmcnt(0) // 000000001038: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 00000000103C: D28F0005 00020282
s_sub_i32 s4, s4, 32 // 000000001044: 8184A004
v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001048: 320A0B03
v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 00000000104C: 380C0D04
global_load_dword v1, v[5:6], off // 000000001050: DC508000 017F0005
s_cmp_lg_u32 s4, 0 // 000000001058: BF078004
s_waitcnt vmcnt(0) // 00000000105C: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 000000001060: D28F0005 00020282
v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001068: 320A0B03
v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 00000000106C: 380C0D04
global_load_dword v1, v[5:6], off // 000000001070: DC508000 017F0005
s_waitcnt vmcnt(0) // 000000001078: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 00000000107C: D28F0005 00020282
v_add_co_u32_e32 v5, vcc, v3, v5 // 000000001084: 320A0B03
v_addc_co_u32_e32 v6, vcc, v4, v6, vcc // 000000001088: 380C0D04
global_load_dword v1, v[5:6], off // 00000000108C: DC508000 017F0005
s_waitcnt vmcnt(0) // 000000001094: BF8C0F70
v_lshlrev_b64 v[5:6], 2, v[1:2] // 000000001098: D28F0005 00020282
You can see the "global_load_dword" call, as well as the repeated s_waitcnt vmcnt(0) (wait for outstanding memory-count zero). The global_load is asynchronous: it starts the memory load "in the background", while the s_waitcnt actually forces the GPU-core to wait for RAM.
So maybe the device driver change was just the compiler-upgrade to better unroll this loop automatically?
3. "global_load_dword v1, v[5:6], off" shows that the memory load is a 32-bit operation + a 64-bit base register v[5:6] is the base register (myBuffer). So that answers that question we had earlier...
---------------
Notes:
* Today I'm running ROCm 4.0
* extractkernel tool to grab the disassembly
* constexpr uint64_t runSize = (1 << 22); (4-million numbers x 4-bytes each == 16MBs tested).
As I mentioned in the original article (Measuring GPU Memory Latency – Chips and Cheese), latency starts to matter when occupancy (available parallelism) is too low to hide latency. Exactly how often that happens and to what extent is going to vary a lot with workload.
[snip]
Anyway it's very hard to say for sure without profiling a workload, and even then interpreting the results is tricky. I suspect that GPUs like GA102 and Navi 21 are big enough that some workloads (particularly at 1080P) might not provide enough parallelism. But that's a guess in the end, since I don't have either of those GPUs to play with.
Agreed. But here's some notes.
* Vega (and older) are 16 x 4 architecturally. That is: each vALU of the CU has 16-wide execution units, and those execution units repeat EVERY instruction 4-times, resulting in your 64-wide programming model.
* NVidia and RDNA are 32x1. The CUs are 32-wide naturally, and do none of the "repetition" stuff. As such, NVidia and RDNA are better tailored for low-occupancy situations.
* NVidia and RDNA both have stalls: NVidia tracks stalls at the assembly level. RDNA handles stalls automatically.
* GPUs are all 32-wide or above from a programming model. That means even the smallest of reasonable workloads is 32-wide. If not, no one would even bother with the whole shader / DirectX / HLSL crap and just program the stuff on your big, beefy CPU already optimized for single-threaded data. Additional CUDA blocks or OpenCL workgroups can provide additional latency hiding, but it should be noted that a single-warp / single-wavefront (with a good enough compiler and enough Instruction-level-parallelism at the C/C++ level) can in fact hide latency.
As such, occupancy doesn't really mean much unless we know exactly the code that was being run. In fact, higher occupancy may be a bad thing: GPUs have to share their vGPRs / registers with other "occupants". So going occupancy 2, or occupancy 4+ means having 1/2 or 1/4th the registers. So the optimization of code (registers vs occupancy) is a very complicated problem.