- Joined
- Jan 8, 2017
- Messages
- 9,624 (3.28/day)
System Name | Good enough |
---|---|
Processor | AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge |
Motherboard | ASRock B650 Pro RS |
Cooling | 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30 |
Memory | 32GB - FURY Beast RGB 5600 Mhz |
Video Card(s) | Sapphire RX 7900 XT - Alphacool Eisblock Aurora |
Storage | 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB |
Display(s) | LG UltraGear 32GN650-B + 4K Samsung TV |
Case | Phanteks NV7 |
Power Supply | GPS-750C |
0.8Kb is still a lot of information if it can be kept full, but alas when it CAN'T usually you have two choices as then you have a shader using power, making heat, and not doing work. One is to improve cache hit rate but that takes a lot of tuning and tweaking, or you can just add more cache to increase the chances the data will be loaded, but that takes more power to run and makes more heat.
That's just not how this works, firstly a shader that doesn't do work doesn't use power (or very little), because of something called power-gating. Not that it wold matter because this rarely happens, GPUs are designed to maximize utilization without the need of big caches/registers and complex caching algorithms. No one does that because those things would take so much more die space that you would have to decrease the overall number of ALUs and that would nullify whatever advantage that was supposed to bring.
A 'shader' is a small program written in GLSL which performs graphics processing, and a 'kernel' is a small program written in OpenCL and doing GPGPU processing. These processes don't need that many registers, they need to load data from system or graphics memory. This operation comes with significant latency. AMD and Nvidia chose similar approaches to hide this unavoidable latency: the grouping of multiple threads. AMD calls such a group a wavefront, Nvidia calls it a warp. A group of threads is the most basic unit of scheduling of GPUs implementing this approach to hide latency, is minimum size of the data processed in SIMD fashion, the smallest executable unit of code, the way to processes a single instruction over all of the threads in it at the same time.
Secondly, the concept of improving the hit rate on a GPU cache doesn't even make sense because there is nothing you can do. You already know that the same sequence of instructions will run thousands of times across multiple CUs, therefor you can schedule the execution in such a way that you can always have the data ready if you have enough memory bandwidth. And that's what everyone does, including Nvidia.
Here's another hint : AMD calls their execution units Stream processors and Nvidia names their cores Streaming Multiprocessors. Still don't believe me ?
GP100 : 3840 shaders , 4MB L2 cache
Vega 64 : 4096 shaders, 4MB L2 cache
Turns out they aren't that different are they ? GCN's problem isn't the cache size or hit rate or anything like that, it's something else. It's the fact that they have a lot more complex logic on chip whereas Nvidia offloads most of it to software, I am not going to go into details but that's what is using a lot of power and what makes gaming performance unimpressive. I'll just name one thing, AMD has logic that allows for scalar instructions to be executed within each CU, Nvidia doesn't have such thing, this is mostly a worthless addition that creates even more scheduling overhead as far as graphics workloads go but great for compute.
Last edited: