That's where you have to balance the length of the pipeline vs the clock speed you want to reach, it's a balancing act. That's where good branch predictors come into play. If your branch predictor is good or at least can learn along the way much like Ryzen's branch predictor can, you can have a long pipeline and not incur a performance penalty. However, if you have a bad branch predictor like what the old Intel Pentium 4 Prescott had a long pipeline can result in a sever loss in performance.
In fact, you don't even need branches or cache misses for more stalls occur due to pipeline length. Code is full of data dependencies, let's say you'll have a simple calculation like this:
Code:
d = a + b + c;
e = a + d;
f = e + b;
g = f + c;
You have multiple dependencies here, which has to be resolved sequentially. Each dependency has to wait for the instruction to be completely executed, meaning the length of the pipeline will affect the length of the stall. CPUs have since the 90s tried to work around this by out-of-order execution, and longer pipelines also means the dependencies has to be executed even earlier to prevent stalls. But eventually this means that branching is going to become a even larger problem, since each misprediction causes all calculations to be discarded. So if there are dependencies after the branching, you'll not only get a larger stall because of the flushing, but also because you'll then have to execute multiple dependencies without any benefit of out-of-order execution. This is why the penalties of long pipelines and mispredictions multiply.
Skylake does in fact have better branch prediction than Ryzen, even old Sandy-Bridge does it better. But branch prediction can only help a bit, since it's basically just statistics about which conditionals usually evaluates to true and which does not. If a conditional is 99% true and 1% false, it will start guessing 99% correct after a few iterations. But if a conditional is ~50% true and ~50% false, the CPU will only guess half of them correctly, and that is in fact the theoretical maximum. If you want to improve performance beyond this, you're left with trying to reduce the penalty costs, or rewriting the software
And one final note; the branch predictor (and prefetcher in general) were much better in Prescott than Athlon64, but the severe penalties of the super-long pipeline outweighed the benefits of a better prefetcher. There are limits to what a good prefetcher can do, so even with the best prefetcher Intel was crushed by a much more simple design.