Obvious typo is obvious: 1 clock != 1 second. 4.0 GHz = 4 billion clocks per second.
My bad, I corrected it.
It does appear I was backwards...assuming no cache misses; that's the point though, isn't it? With two threads in the core, more usually gets done. The difference between HTT and Bulldozer's implementation is that Bulldozer should theoretically (assuming all else was equal) be able to do more integer operations in the same time frame. That still doesn't change the definition of a core.
The point is that overall poor performance is due to a slim core, not a shared module and the throughput of BD in my last post is a very clear indicator of that.
I find it ironic K10 beats Bulldozer on pretty much every one. The only advantage Bulldozer has over K10 is the higher clockspeeds.
Clock wise, K10 was a significantly larger core but, it also had a lot more under the hood dedicated for one core. I'll give AMD that they were able to squeeze quite a bit of parallel throughput on these CPUs but that's never the kind of workload consumers really need to care about.
The simple fact is that BD has two real cores, the problem is that while uOps execute just as fast, instructions that have certain combinations of uOps is going to impact AMD's BD core a lot more than one of Intel's. Even Intel has shown that they would rather beef up a core and AMD's problem is that two lanky cores isn't going to provide the single-threaded throughput you want. If there are instructions that are taking fewer cycles to complete on Intel CPUs, that's a pretty tell tale sign that it's the cores themselves. Add to that the fact that BD cores scale almost linearly on purely parallel workloads (excluding certain FP applications but, that really depends on the particular instructions being used.)
Nothing here to me says they're not 8 real cores. What people are pissed off about is that they're 8 gimped cores, even for integer operations but, that's not because of shared components. If it was a real implementation of hardware SMT like hyper-threading, we wouldn't see the kind of scaling we're seeing with modules which is near linear for purely parallel workloads. What we're seeing is 8 core CPUs where every core is something like 80% of what it should be. It scales properly and runs properly, with the exception that single threaded performance is 20% less than where it should have been and that people were expecting Phenom II like performance in single-threaded applications but BD performance on multi-threaded applications which wasn't the result.
AMD made some choices and it resulted in focusing on more cores and less on individual core performance. As a result, people got irritated that their skinny cores couldn't bite off enough at once and wanted their fatter cores that were more efficient in single threaded applications back (here comes Xen!)
Our disagreement isn't that Bulldozer blows, it's how it blows, and I think blaming the FPU and shared components is a bit of a stretch given the amount of information that indicates that even integer performance is tailing K10 per clock. They only try to make up for that with clock speeds, as you said. None of this has to do with whether it has 8 real cores or not, it has to do with how shitty the slimmed down integer cores are. Mix that with the shared FPU and added latencies on FP instructions, and you have a recipe lackluster performance. All of which still can happen even if there are 8 real cores.
Take Intel's 8c Atom the C2750 I think it is. It's performance trails core series CPUs at the same clock speed with half as many cores but with SMT, so does it mean that the Atom doesn't have real cores? NO! It means the Atom's core is lacking in performance despite having 8 real cores and doesn't efficiently use every clock cycle like the i5 and i7s, just like Bulldozer.