Easy to start a thread, yes. You're forgetting the time it takes to spin up the thread and to schedule it, the time waiting if you're using locks, the latency incurred if you use a queue instead. It's a little more than just spinning up a thread, it's how you use it and what its characteristics are.
Vulkan is basically OpenGL but every command that would normally be executed in the OpenGL global scope would get executed as command buffers, or in other words, queues of commands. So the application prepares a series of commands that tells the Vulkan engine how to prepare stuff. The performance is had from those command buffers because you can submit multiple command buffers so you essentially gain a queue of queues which represents the full set of processing you need to do. This decouples the actual rendering process/thread from the process/thread that describe what needed to be done. This is a case where the latency incurred from making hundreds of thousands of OpenGL calls to draw is greater than submitting an order list of things that need to be done because the engine can process the workload so long as there is a queue to be processed versus waiting for a render loop to be calling draw commands and such but, it's not like Vulkan is killing/starting processes or thread to do all of this, the same threads are used because of the overhead of setting everything up. It's also not unrealistic that some command buffers could be static and might be prepared ahead of time so all that needs to be done is to submit them when rendering.
So I write Clojure which is a Lisp on top of the JVM and JavaScript. It's a great language for doing concurrent programming. So consider the time it takes to spin up a thread:
Code:
> (time (async/<!! (async/thread (+ 1 2 3))))
"Elapsed time: 0.834924 msecs"
6
Just to add 1, 2, and 3 on a new thread and return the value takes almost a full millisecond the majority of that is spinning up the thread:
Code:
(time (+ 1 2 3))
"Elapsed time: 0.049105 msecs"
6
Keep in mind, I'm running this interactively so the time includes parsing and compiling since it's being JIT'ed on the spot. It would take less time if I AOT'ed it but, you would still see the same kind of difference in performance and probably even more so because the adding function is actually a lot faster than that after being compiled.