Right, so far. Vega have plenty of computational performance and memory bandwidth, and it works fine for simple compute workloads, so it all comes down to utilization under various workloads.
1) It's not lack of memory bandwidth. RX Vega 64 have 483.8 GB/s, the same as GeForce 1080 Ti (484.3 GB/s), so there is plenty.
2) Well, considering most top games are console ports, the game bias today is favoring AMD more than ever. Still most people are misguided what is actually done in "optimizations" from developers. In principle, games are written using a common graphics API, and none of the big ones are optimized by design for any GPU architecture or a specific model. Developers are of course free to create different render paths for various hardware, but this is rare and shouldn't be done, it's commonly only used when certain hardware have major problems with certain workloads. Many games are still marginally biased one way or the other, this is not intentionally, but simply a consequence of most developers doing the critical development phases on one vendor's hardware, and then by accident doing design choices which favors one of them. This bias is still relatively small, rarely over 5-10%.
So let's put this one to rest once and for all; games don't suck because they are not optimized for a specific GPU. It doesn't work that way.
Do you have concrete evidence of that?
Even if that is true, the point of benchmarking 15-20 games is that it will eliminate
outliers.
Your observation of idle resources is correct(3), that is the result of the big problem with GCN.
You raise some important questions here, but the assessment is wrong (4).
As I've mentioned, GCN scales nearly perfect on simple compute workloads. So if a piece of hardware can scale perfectly, then you might be tempted to think that the error is not the hardware but the workload? Well, that's the most common "engineering" mistake; you have a problem (the task of rendering) and a solution (hardware), and when the solution is not working satisfactory, you re-engineer the problem not the solution. This is why we always hear people scream that "games are not optimized for this hardware yet", well the truth is that games rarely are.
The task of rendering is of course in principle just math, but it's not as simple as people think. It's actually a pipeline of workloads, many of which may be heavily parallel within a block, but may also have tremendous amounts of resource dependencies. The GPU have to divide this rendering tasks into small worker threads (GPU threads, not CPU threads) which runs on the clusters, and based on memory controller(s), cache, etc. it has to schedule things to that the GPU is well saturated at any time. Many things can cause stalls, but the primary ones are resource dependencies (e.g. multiple cores needs the same texture at the same time) and dependencies between workloads. Nearly all of Nvidia's efficiency advantage comes down to this, which answers your (3).
Even with the new "low level APIs", developers still can't access low level instructions or even low-level scheduling on the GPU. There are certainly things developers can do to render more efficiently, but most of that will be bigger things (on a logic or algorithmic level) that benefits everyone, like changing the logic in a shader program or achieving something with less API calls. The true low-level optimizations that people fantasize about is simply not possible yet, even if people wanted to.