I'm not so sure HBM's latency can be as low as required for usage in a cache system.
Latency doesn't seem to move like at all. At least that's what happened for DDR.
Well, HBM does have a latency penalty, but it makes up for that through its ability to burst a lot of data and because it's split into several channels, you can actually queue up a lot of memory requests and get stuff back rapidfire. So while there is overhead involved, it might not actually be that bad depending on how much data you need to pull at once. Think about it, AMD beefed up the size of its last level of cache with the latest Zen chips. Why, would they do that? The answer is simple, an off die I/O chiplet introduces latency and you need a way to buffer that latency. Depending on the caching strategy, that last level of cache might get a ton of hits and the more hits you get, the more insulated you are from the latency cost.
You also have to consider what Apple is doing. This level in the memory hierarchy has to also be able to support a GPU and AI circuitry as well. HBM is definitely well suited towards those sort of tasks, so all in all, it's probably a wash when it comes to latency. The real advantage comes from the memory bandwidth with relatively low power consumption and a high memory density.