I'm a bit perplexed at how Smart access memory works in comparison to how it's always worked what's the key difference between the two with like a flow chart. What's being done differently doesn't the CPU always have access to VRAM anyway!? I imagine it's bypassing some step in the chain for quicker access than how it's been handle in the past, but that's the part I'm curious about. I mean I can access a GPU's VRAM now and the CPU and system memory obviously plays some role in the process. The mere fact that the VRAM performance slows down around the point where the L2 cache is saturated on my CPU seems to indicate the CPU design plays a role though it seems to bottleneck by system memory performance along with the CPU L2 cache and core count not thread count which adds to the overall combined L2 cache structure. You see a huge regression of performance beyond the theoretical limits of the L2 cache it seems to peak at that point and it 4-way on my CPU slows a bit up to 2MB file sizes then drops off quite rapidly after that point. If you disable the physical core too the bandwidth regresses as well so the combined L2 cache impacts it from what I've seen.
CPUs only have access to RAM on PCIe devices in 265MB chunks at a time. SAM gives the CPU direct access to the entire VRAM at any time.
52CU and 44CU is the next logical step based on what's already released AMD seems to disable 8CU's at a time. I can see them doing a 10GB or 14GB capacity device. It would be interesting if they utilized GDDR6/GDDR6X together and use it along side variable rate shading say use the GDDR6 when you scale the scene image quality back further and the GDDR6X at the higher quality giving mixed peformance at a better price. I would think they'd consider reducing the memory bus width to 128-bit or 192-bit for SKU's with those CU counts though if paired with infinity cache. Interesting to think about the infinity cache in a CF setup how it impacts the latency I'd expect less micro stutter. The 99 percentiles will be interesting to look at for RDNA2 with all the added bandwidth and I/O. I suppose 36CU's is possible as well by extension, but idk how the profit margins would be the low end of the market is erroding each generation further and further not to mention Intel entering the dGPU market will compound that situation further. I don't think a 30CU is likely/possibly for RDNA2 it would end up being 28CU if anything and kind of doubtful unless they wanted that for a APU/mobile then perhaps.
52 and 44 CUs would be very small steps. Also, AMD likes to do 8CU cuts? Yet the 6800 has 12 fewer CUs than the 6800 XT? Yeah, sorry, that doesn't quite add up. I'm very much hoping Navi 22 has more than 40 CUs, and I'd be very happy if it has 48. Any more than that is quite unlikely IMO. RDNA (1) scaled down to 24 CUs with Navi 14, so I would frankly be surprised if we didn't see RDNA 2 scale down just as far - though hopefully they'll increase the CU count a bit at the low end. There'd be a lot of sales volume in a low-end, low CU count, high-clocking GPU, and margins could be good if they can get by with a 128-bit bus for that part. I would very much welcome a new 75W RDNA2 GPU for slot powered applications!
Combining two different memory technologies like you are suggesting would be a complete and utter nightmare. Either you'd need to spend
a lot of time and compute shuffling data back and forth between the two VRAM pools, or you'd need to double the size of each (i.e. instead of a 16GB GPU you'd need a 16+16GB GPU), driving up prices massively. Not to mention the board space requirements - those boards would be massive, expensive, and very power hungry. And then there's all the issues getting this to work - if you're using VRS as a differentiator then parts of the GPU need to be rendering a scene from one VRAM pool with the rest of the GPU rendering
the same scene from
a different VRAM pool, which would either mean waiting massive amounts of time for data to copy over, tanking performance, or keeping data duplicated in two VRAM pools simultaneously, which is both expensive in terms of power and would cause all kinds of issues with two different render passes and VRAM pools each informing new data being loaded to both pools at the same time. As I said: a complete and utter nightmare. Not to mention that one of the main points of Infinity Cache is to lower VRAM bandwidth needs. Adding something like this on top makes no sense.
I would expect narrower buses for lower end GPUs, though the IC will likely also shrink due to the sheer die area requirements of 128MB of SRAM. I'm hoping for 96MB of IC and a 256-bit or 192-bit bus for the next cards down. 128 won't be doable unless they keep the full size cache, and even then that sounds anemic for a GPU at that performance level (RTX 2080-2070-ish).
From AMD's utter lack of mentioning it, I'm guessing CrossFire is just as dead now as it was for RDNA1, with the only support being in major benchmarks.