There is hardware both clusters rely on (not independent). AMD's implementation of SMT is only about 25% faster than Intel's implementation but at significantly higher cost in terms of design and die space. An extra physical core should always represent near 100% increase in performance no matter the application because it is truly independent (both cores show this going up to 4). What AMD did with Bulldozer is enable a single core to be able to handle two threads more efficiently when two threads are in the core. There's absolutely nothing wrong with that. In fact, AMD's implementation is quite better than Intel's but that in no way means it has two distinct cores. Nothing suggests it does except the box. The hardware isn't there, the relative performance isn't there, and newer operating systems show half of what AMD claims.
You're wrong. If you worked with Xeons, in datacenters, company medium clusters or supercomputers, you'd know that. Software and the workload are having a huge repercussion on how multicore and multiprocessor behave.
I'll show you yet another example of a program i personally worked on.
http://compression.ca/pbzip2/
Nothing in real life have an infinite linear curve. There's always a limit to which an application can be parallelized. Every processor, every software and every instructions show that kind of behaviour.
It is a law, it is calculated, it is calculable.
http://research.cs.wisc.edu/multifacet/amdahl/
Get your science right.
EDIT : You remained blind to all proof presented to you so far. Don't say there's no proof, everything converge as a proof. If you had an once of honesty i can prove you are wrong. I made the test myself some times ago, as i worked on the ondemand governor, which made the core clock fluctuate depending on the workflow. I made the test to be 100% sure if one core is at 1600MHz and the other is at 3900MHz, that the virtual machine bound to one of the module core won't get affected by the other at lower speed. Both integers and FPU are slower on the 1600MHz core, and both are faster on the 3900MHz core. Both work independently. Until i throw an AVX instruction, then the entire FPU clock is getting at the speed of the core asking for the unification, until the instruction is done. You can even try it yourself with QEMU/KVM, while putting a core affinity to the VMs, one on the first module core, other on the second module core. Very easy to replicate. When i don't use AVX, i have 100% of the time 2 cores in a module. But you continue putting your head in sand, playing blind.
EDIT 2 : I am also surprised on how inaccurate your point of views are. It's actually the opposite, AMD have a semi SMT, because it can't have another thread on a core in a module. It's one of the bottleneck of the cores in the module. with the fact it is not ordering well. Hyperthreading is a much much better SMT implementation, and takes a lot more space on the die. Complete opposite. For that you seem to talk about, you're claiming it's transparent to the OS, when it's not, at all. 95% of the thread management is made my the kernel, the threads library and the kernel, all software. The processor only order them to the right core/module and then to the right thread. The processor don't decide what core will take what thread, the kernel do. What the processor decide is what it will do with the thread. The Intel SMT is better too because it don't have 2 cores to supply, the hyperthreaded one don't have to be on time and constantly supplied like the AMD Bulldozer have to.
Microsoft is very very bad at handling threads, unlike Linux. Because Linux was used with SPARC and other servers to have like 8 chips, with 16 cores, 128 threads. But before hyperthreading, Windows kernel never had a proper threading library. It's normal for a SPARC to have so many threads, as it's a RISC, not a CISC. In Linux they just have to modify some point of the kernel to make it recognize the module as a core with multiple threads. It doesn't change anything, except it will address threads like it should be. If the big SMT would be detrimental for the design, you can be sure a SPARC or ALPHA processor would be bottleneck like there's no tomorrow. But it's not. Pretty much everything composing your logic is opposite to what is established in computer engineering.