The Bulldozer family groups pairs of cores into modules. Within these modules, some of the hardware is shared between the two cores. Performance problems can occur when both cores try to use this shared hardware heavily at the same time. This is similar, if less pronounced, to the per-thread performance penalty of running two threads on the same physical core using Intel's Hyperthreading.
In the early members of the family (Bulldozer and Piledriver), the instruction decoder (capable of decoding four instructions per cycle) is shared. It decodes instructions for one core each cycle, switching to the other core (if it's active) on the next cycle. In later members of the family (Steamroller and Excavator), a separate decoder is provided for each core, eliminating this bottleneck.
In all members of the family, the L1 I-cache and D-cache are shared. Since these caches are quite small (compared to Phenom II), this causes cache thrashing at a higher level when both cores are active than when only one is. The L1 caches are larger in Excavator than in previous members of the family, which contributes to its better efficiency.
The FPU is also shared in all members of the family. Most FPU instructions are multiplies or adds, so they use the FMAC pipelines, of which there are two per module. When both cores are running FPU-heavy code, effectively only one FMAC pipeline is available to each core. This is however no worse than in Phenom II, which had one multiplier and one adder in its FPU, in separate pipelines.
This is a diagram for one module, which has 2 cores. It has 2 integer units, 1 FPU, and shares an L2 cache. Conceptually, it is twice as fast at integer math in a thread, and half as fast in floating point math.
Since most server/rendering workloads are integer based, CMT scales well in multi-threading - AS LONG AS the threads are being run correctly on modules, and not split between multiple modules unnecessarily.
Windows 7 has an issue with how Bulldozer-based processors get processor threads scheduled. W7 treats them like fully independent cores, and will willy-nilly schedule threads wherever. This can cause tasks that otherwise should share FPU resources, to split across multiple modules and will cause performance degradation.
This was changed in Windows 8/8.1/10, by treating the processor as a 4 core, 8 thread chip (instead of 8 core, 8 thread) in order to properly schedule threads. On a high level, this actually emulates SMT (Hyperthreading) and results in a decent performance boost in W8/W10 for AMD processors.
There is a
patch (manual install) for windows 7 that makes it schedule in the same manner, though doesn't change the appearance of task manager. You still see all 8 cores.
I don't know who pissed in your tea Ford. You also keep stating your opinion as fact.