No, I actually understand this, and I don't refer random things from the Internet I don't comprehend.
As mentioned in #124, you can have multiple threads building a queue, but it comes at a cost of synchronization overhead. Rendering is a pipelined process of steps, you can parallelize inside each step, but the steps still has to be executed in a serial manner. So if a rendering pass consists of steps a) -> b) -> c) -> d) -> …, you can use this deferred context to have four threads submitting commands to the queue. You can't have one thread working on c) while another is working on a). Synchronization of CPU threads are very expensive, and doing it many times during a single frame will cost milliseconds. It only makes sense when the overhead of the rendering thread (engine overhead, not API overhead) is greater than the synchronization overhead, which is unusual. This code example shows a simple scene with just simple cubes, while rendering in games are much more complex, so the application of this is much more challenging. So this technique is only applicable to certain scenarios or edge cases.
As evident in the code example you clearly don't understand, this has to be designed into the rendering engine. This feature works around CPU overhead in the rendering engine itself, not the driver. I can guarantee that this is not what makes Pascal and Maxwell outperform GCN, since this is in the rendering engine's realm and outside Nvidia's control. And using deferred contexts makes no difference from the GPU side, this is purely a rendering engine optimization.