FPU isn't the only component shared. The entire instruction decoder and associated L1 cache covers FPU and both integer clusters. The only thing that makes Bulldozer unique is the fact it has two integer clusters instead of one big one. The whole cohesive unit is still a core.
The only shared components are the fetch/decode units, L1 instruction cache, L2 cache, and FPU.
Come piledriver, AMD went from a 4-way decoder to two 2-way decoders which both either server up one of the integer cores or the floating point unit, which leaves the fetch unit, L1i, L2, and the FPU.
The Core 2 had a shared L2 cache and it is considered to have two cores, so I consider the L2 argument moot, which leaves the fetch unit, the L1i, and the FPU.
The fetch unit, testing seems to indicate that it is not a bottleneck and that improving it won't yield much tangible benefits:
Agner’s tests, however, may shed some light on the problem. According to his work, the fetch units on Bulldozer, Piledriver, and Steamroller, despite being theoretically capable of handling up to 32 bytes (16 bytes per core) tops out in real-world tests at 21 bytes per clock. This implies that doubling the decode units couldn’t help much — not if the problem is farther up the line. Steamroller does implement some features, like a very small loop buffer, that help take pressure off the decode stages by storing very small previously decoded loops (up to 40 micro-instructions), but the fact that doubling up on decoder stages only modestly improved overall performance implies that significant bottlenecks still exist.
source
So that would leave the L1i and the FPU. The FPU is undoubtably shared, not denying that and the L1i cache is shared because it makes sense when the fetch units are also shared. So that leaves just L1i and FPU for shared components that may make a difference.
What blows my mind is that people forget that AMD went from the Phenom II being able to execute 3 integer operations per clock cycle to two on the current architecture, which could have some serious implications for purely integer code. However, I think the source I provided earlier seems to sum it up best:
According to Agner, ” Two of the pipes have all the integer execution units while the other two pipes are used only for memory read instructions and address generation (not LEA), and on some models for simple register moves. This means that the processor can execute only two integer ALU instructions per clock cycle, where previous models can execute three. This is a serious bottleneck for pure integer code. The single-core throughput for integer code can actually be doubled by doing half of the instructions in vector registers, even if only one element of each vector is used.”
This has been the case since Bulldozer debuted — but issues here could explain why integer performance on Steamroller is so low compared to other cores. This is where things become frustratingly opaque — each of the areas we’ve identified could be the principle bottleneck — or it’s possible that the bottleneck is a combination of multiple factors (long pipelines, low fetch, cache collisions and low integer performance).
I'm not disagreeing that Bulldozer's performance sucks, that's why I got my 3820 but, I'm not convinced that it's the shared components but rather
skimpy dedicated components that could be impacting performance. Xen, having a beefier integer core, very well might make up for the shortcomings of the dedicated components in these CPUs.
That's my only point. There is nothing to stop the dedicated hardware from being the bottleneck, even more so if they chopped it down to fit two of any given component in.
With that all said, I still think the really long pipeline is probably the main issue.