Saturday, December 17th 2011
AMD Bulldozer Threading Hotfix Pulled
Since we reported on the AMD Bulldozer hotfix, The Tech Report reports in an updated post, that the Bulldozer threading hotfix said to improve performance of the processor, has been pulled:
We've spoken with an industry source familiar with this situation, and it appears the release of this hotfix was either inadvertent, premature, or both. There is indeed a Bulldozer threading patch for Windows in the works, but it should come in two parts, not just one. The patch that was briefly released is only one portion of the total solution, and it may very well reduce performance if used on its own. We're hearing the full Windows update for Bulldozer performance optimization is scheduled for release in Q1 of 2012. For now, Bulldozer owners, the best thing to do is to sit tight and wait.It will be very interesting indeed to see how this much maligned processor benchmarks after the fully developed patch is released. It's true, actually attempting to download the hotfix and agreeing to the licence terms, at the moment, one is lead to a page that shows it as unavailable.
90 Comments on AMD Bulldozer Threading Hotfix Pulled
In an Intel CPU, cache gets slower, going from L1, to L2, to L3. Then ram is even a bit slower yet. The speed differences are offset by having a larger data store.
In Bulldozer, the L2 cache is a fraction the speed of both the L1 and L3. Why? What benefit does this serve?
If I had to make a guess without knowing the benchmark it would be the dispatch not the L2
Now I think about it, what is the word size of an FPU on previous 64-bit processors? Should be the same of AMD and Intel, I'd expect. (Yes, I know I could google it, but I'd rather you guys just explain it to me. :p )
ARM is just getting 128bit SIMD with A15-Cortexs
The whole thing about a single scheduler revolves around the FP scheduler being shared for the seperate 128-bit "pipes"(as is plain the image I posted above), but it seem to me, evne workloads that don't have any floating point, and are integer based, benefit from moving dual threads to individual cores.
And to me, the figure of 10% performacne increases, seems to fit with the L2 cache being slow, rather than with the FP scheduler not being wide enough.
Now, there's a differnce in Windows 8 and Windows 7, in how workloads are managed in a CPU, due to Windows 8 allowing what is called "core parking". This is basically fully shutting off a core when it's not in use, for power savings. Naturally, such control needs to be finely tuned so that threads do not stall, and bringing similar functionality to Windows 7 is what this patch s supposed to be all about. The ability to dynamically move threads from one core to the next without stalling the thread is not really a big thing, and if it really was an issue with the FP scheduler, there'd be much more than just a 10% boost possible...sometimes it would be a doubling of speed.
That said, no, I do not think there is any "saving grace" for BD in this. I really feel the L2 cache is to slow, and the numbers seem to agree. When someone can tell us why the L2 cache seems to be slow, it might be more clear why BD "sucks".
Price the 8150 @ $200, and it's killer. There's really nothing wrong with BD's design. The only thing that makes it look wrong is the pricing, and that's because everyone considers BD to compete with SB(rightly so).
The problem with multithreading once again won't be there
Memory Sub System isn't really the problem other than the L3 but the L3 problem only starts with more than one module being used
M$ may like AMD some because they are tired of Intel's monopoly, but ultimately they make more money thanks to Intel than not. Money talks. But the patch for BD was a given. There were features of it that were not being implemented right now. Or at least, not as well as they could. M$ would do the same for Intel if they came out with a new tech.
They aren't directly in M$'s pocket. More likely they are in Intel's pocket because without them, Intel faces antitrust.
Nobody cares about BD's multi-threaded performance. I'm not sure we're on the same topic here.
Single Threaded performance -> Dispatch
Scroll up, I said dispatch already
Dispatch is shared between the two cores and the shared FP....It is divided into 4 dispatches per clock(2 macro-ops per unit(Core A, Core B, FPU x 2)...unless you disable a cluster then it will be 2 dispatches per clock(Core Ax2, FPUx2)...(4 Macro-ops to Core A and 4 Macro-ops to FPU and the FPU will only need to use Core A stuff making a ~17 stage pipeline effectively a ~14 stage pipeline(each core only needs 2 macro-ops and to complete core commands the FPU only really needs 4 macro-ops(decoder can do 8 macro-ops))
On Cinebench anyway, not sure about anything else as I've not tested it.
But Cinebench should be a program that would highlight this right?
1 -It lacks hand-tuned optimisation (somebody already mention this)
2 -Dispatch Unit needs major tweaking
3 -L1 and L2 cache needs a speed boost. Something wrong there :confused: Ive seen tests done that shows a 4C4M beats out a 4C2M setup in almost all tests done. And the higher you scale the CPU clock the better the 4C4M becomes versus the 4C2M. This sharing within the bulldozer design needs some real fine tuning IMO.
however if one program runs two threads that dont share data or two progs run a thread each then in this case the scheduler needs to run one thread per module to optimise perfomance and none of this is presently being done by windows correctly hence lower performance higher heat and watts so a patch should reap rewards if it works right