Bullshit lower latency memory access will always give better performance that is the entire point
No, you still don't get what caching does.
If you have two comparable GPUs, GPU A have 12 GB, and GPU B have 8 GB + caching, the caching will try to weigh up for the missing memory in GPU B. Whenever you need less than 8 GB, there will be no difference, and when you need more GPU B will perform
up to the level of GPU A, never above it. Your confusion is what to compare it to. HBC will not have lower latency than other GPU memory, only lower latency than falling back to system memory.
Just because something is done in hardware, it doesn't mean it just magically works. The driver needs to be aware of the thing.
The driver is aware of the hardware capabilities, but it does not micro-manage low-level scheduling inside the GPU, that is controlled on the GPU side. Tiled rasterization is not a new unit with a new feature set to expose through an API, it's a reordering of operations inside the GPU.
And the "it's not even a new architecture" is getting very old. Everyone parroting how it's not a new architecture and yet they changed nearly everything in the core. … And 9 months is really not a lot of time.
It has been enough in the past, and it's not like they start from scratch when the working chips arrive. Remember, they did demo it working in late December. Well at least this time with all the delays, the driver should be ~2.5 months more mature than the drivers of Polaris and Fiji at their respective releases.
But this boils down to what we've heard for every single generation from AMD the last five years; at release AMD fans say we can't judge it, because the driver are immature. Yet, they somehow "know" it will improve, we only need to give it more time, but no substantial improvement ever materializes.