You're asking the questions that have boggled my mind since the conception of the APU, and certainly what I find the most interesting challenge moving forward.
There are many conceivable answers, a wider bus among them (256-bit ddr3 would be sufficient for a ~512sp design), although perhaps less probable as we move to ddr4 and it's 1dimm-per-channel restriction and larger, more demanding iGPUs that will quickly outpace a 128-bit ddr4 bus. Certainly there is bga, but I wonder if amd is really willing to take that leap with their larger designs (as a consumer platform, ie not the ps4 or iterations of bobcat).
Hypertransport, if not a discrete (or optional) gddr5 bus to a gpu cache (ala what used to be called Sideport Memory in the discrete IGP days) seemed like a realistic option even up to this generation. While 32-bit, with the max bandwidth of a link resting somewhere near what gddr5 is capable on AMD's current gpu controllers, and meshing fairly nicely with being around half of what a 32/28nm iGPU would need (and twice what a 128-bit ddr3 bus could deliver), that would have more-or-less made sense. Obviously moving past this gen it would be less so, unless itself coupled with a ddr4 bus (ie ddr4 + gddrX).
From there, we have the possibilities of larger/faster caches (like the X1's on-die ram) offsetting what is needed externally. There is also the possibility of things like on-package off-die caches (not unlike Intel's Iris) as well stacked dram like Volta.
Whatever their solution, they need to do it yesterday. Their strength is (and has always been) in the floating point computation per mm (per process/cost) their designs deliver. While HSA capitalizes on this fact, as it should, with each passing node they lose that (realistic) advantage to intel, whom can ramp clocks higher until they reach parity in design (and then clock them lower to save power) even as their priority lies in improving their cpu cores. With each passing gpu gen nvidia grows closer to parity, as they are clearly receding from purely thinking of their designs as efficient gpus to rather more-or-less a floating point core (that makes sense as such unit with or without the shell of a cpu). The scary thing about all that is...intel and nvidia, those least dependant on memory bandwidth currently, have shown their plans for going forward. AMD, whom already is restricted on all fronts by this reality, has not (outside the ps4.)
I find that sincerely troubling. No doubt they have an answer...I just hope it comes sooner rather than later.