Take a look at what yuzu and Ryujinx are doing to emulate the Tegra GPU, you would be surprised.
I need your pointers, if you wouldn't mind.
I looked at yuzu and am not surprised. Nvidia and their extensions. And I don't mean in a cynical manner. It has been present since an undercover developer anonymously stated the state of the industry by naming them Vendor A and B.
I still hold the notion, treating mobile gpus as big gpus is besides the point. We are trying to make the most of active registers, nothing more. Having to schedule 2 cycles worth of data is great for instruction latency which frees up cycles. The gpu suddenly has twice the register fidelity in instruction sequencing. Using all registers for a single workload versus 5 different workloads with only a quarter of the registers active, you get the picture.
This is not the point in a 512 shader machine, but in a 4096 shader big die, you bet you could do better than the same automatic compiler just interleaving work and not looking out to shader runtime. You cannot expect memory to be the slowest proponent indefinitely. If it goes fast, you need a proper driver in the seat.