So I love programming my Vega64. Except last night, I discovered that its not doing hipMemcpy correctly for sizes larger than 1GB. I'm wondering if there's a PCIe bug in my system somehow, or if my Vega64 is somehow defective. I rented an instance from GPUEater.com, and the GPUEater.com instance of Vega56 runs my code correctly (same ROCm drivers and everything).
This is a bad time to figure out a hardware error as a AMD GPU-programming fan. Navi doesn't work in ROCm reliably yet, while Navi2x isn't released yet.
It could be the motherboard though (ASRock Taichi x399), because hipMemcpy is a PCIe 3.0 traversal. Or it could be the card itself. I don't think its my CPU or RAM, because those parts have survived a suite of other tests already.
--------------
I probably can continue to make small test programs on my home computer, then rent out GPUEater.com instances whenever I'm doing a "production run", since they charge per-second of instance time. So its not a big deal to spin up instances for a few minutes and shut them down.