I recently decided to finally do something nice to my old Asus P6T Deluxe with Core i7 920, and started off down the six-core Xeon rabbit hole, so found my way here. I've spent many hours reading this thread - great stuff. My thanks to all the contributors.
I went with an X5675 initially, but the way it hit the locked 95W TDP limit regularly and wouldn't hold max turbo bugged me, so I decided to go nuts and give the thing a W3690.
I'd always wanted to play around with an unlocked CPU - never did particularly like the "brute force" BCLK approach, so was interested in trying a "softer" overclock, using the multipliers and turbo as designed, retaining the option to go slow. So I started experimenting with BCLK=133MHz.
Learnt lots of interesting things in this thread, but also picked up a number of other tips I think it would be worth sharing. Maybe a few years too late for the X58-specific bits, but hey... Most of this should be applicable to locked X58 T3500-type platforms, and I think some remains applicable to current multiplier-unlocked CPUs. Work done on Windows 7, but I don't think much has changed in Windows 10.
First point is that a BCLK overclock forces the core to go faster, all the time. Whereas if you're using Throttlestop and turbo multipliers, you have to kind of
persuade your machine to enter the "turbo zone" to use your increased multipliers. All sorts of subtle settings become important.
One of the most important is the rarely-mentioned "energy-performance bias" MSR register in the CPU, which I've seen very little discussion of, outside of some Linux folks. (I've been monitoring and messing with MSRs using the RW-Everything utility).
Windows sets this energy-performance bias from the selected power plan. Balanced sets it to "5", at which point the W3690 seems very reluctant to enter turbo state given just a single-thread load. Even with Windows telling it to enter power state "27 T" (as Throttlestop would call it) via the PERF_CTL MSR, PERF_STATUS usually reports back at "26", and you see x26.0 in Throttlestop. No Turbo. It's possible that this is due in part to the way Windows smears a single thread across multiple cores - no single core is loaded 100% long enough to trip the "turbo" threshold, perhaps. OS keeps requesting 100%/turbo, CPU can't be bothered, despite there being a flat-out single-thread workload to process, hopping between the cores. An apparently horrible Windows-vs-Intel-Turbo interaction. I gather Linux works better.
When you set Windows to "High performance", it sets the "energy-performance bias" MSR to "0" - ie maximum performance. At that point the W3690 actually seems to ignore PERF_CTL, ie any request for lower processor state, and PERF_STATUS shows it always as "27 T". Multipliers only drop when idle. And Turbo is always attempted. Yay!
(It's important to know that the Power plans have a lot of hidden settings - not just the ones you can modify by default. They are different. You can expose a lot of settings with powercfg commands, but some like the CPU "energy-performance bias" seem to be totally hidden and hard coded: "High performance"=0, "Balanced"=5, "Power saver"=7. Manually poking the register with RW-Everything confirmed that this is the key make-turbo-work difference for "High performance", not any of the other setting differences).
The next point I found important is to
actually use the per-core multiplier settings. You've got an unlocked core, so do what Intel do by default, and run fewer cores faster. You don't
have to limit yourself to your 12-thread stable speed.
In my tests (with auto voltage and standard Vdroop), I established that I was linx/prime95 stable at x29 (3.87GHz), but with immediate linx death at x30 (4GHz). I was then able to get quite a bit better on fewer cores; apparently stable stock voltage multipliers were 29/29/30/30/31/32. I can single-thread linx or prime95 happily at 4.27GHz, stock voltage. Similarly, overvolting (Vcore 1.4125V, standard droop, CPU PLL 1.86V), I got stable with 32/32/33/33/34/35, so 6-core 4.27GHz, 1-core 4.67GHz.
I believe to some extent this arises naturally from the Vdroop. As the number of active cores decreases, current goes down, voltage goes up, so maximum speed increases. And obviously 1-core 4.67GHz is much less of an overheating problem than 6-core would be...
(Stability testing to establish fewer-core multipliers was using 1-4 threads in linx and prime95. I guess for full robustness I should shut off physical cores in the BIOS or msconfig to let me load 2 threads per core with fewer cores, but in practice the Windows scheduler should never be loading a second thread on a core until all 6 cores are active anyway. If you've activated a fewer-than-6-core multiplier, you're basically not hyperthreading. I've also not done corresponding testing with LLC compensation - my guess is that there is less to be gained.)
Next, with a 400MHz difference between 6 core and 1 core, you need to actually get into the fewer-core states as much as possible to keep the multiplier up. Again, the Windows scheduler doesn't seem to help - its tendency to spread tasks across all cores tends to push you towards lower multipliers with too many cores active. One tip I found for Windows 7 (maybe not needed for Windows 8+) was to mess with powercfg to unlock the "Processor performance core parking core override" setting, and disable it. When enabled (as by default) Windows 7 always keeps 1 thread per physical core unparked, which encourages the scheduler to spread work across all 6 cores, reducing your turbo multiplier. Once disabled, it will fully park some cores, reducing the number of cores that low-thread loads are spread across.
This seems to work in theory - at least it makes my Throttlestop average multipliers more like "34.6" rather than "34.1", for example - but whether it helps in real life scenarios is harder to say. Should we squeeze extra work into the 4th core, competing with its other users, or do we slow all 4 cores a bit and bring a 5th online for the new work? And I understand there can be latency effects of the parking/unparking, and many like to disable it.
This little chart of all my stable speeds nicely illustrates the difference between a BCLK and a turbo multiplier overclock. The 920 and 5675 were BCLK overclocked, and their +2 or +3 turbo boost is fairly incidental, if you even bother to leave turbo on. But if you're pushing an unlocked chip without raising BCLK, like the W3690 here, the difference between base and turbo frequency becomes huge, and you really have to make sure you don't end up pootling at 3.5GHz.
Given the above - the W3690 does have one minor advantage over the cheaper W3680 - the locked base multiplier is one step up. So if for whatever reason the turbo does fail to engage, you're left on a faster base clock. That means you may often get better performance in "Balanced" or "Power saver" modes, and there may be a slight responsiveness benefit there. But once you've got the turbo engaged in "High performance" with overclocked multipliers, there's no real difference.
Another tip is to set a sensible TDP limit. You don't have to set it to "high enough to never lower the multiplier". Set it to "high enough to never limit in a real interactive application". Letting maniacs like linx or prime95 get power-limited seems reasonable, if it's going to stop your CPU frying and your fans going berserk. My tests showed that 150W was sufficient for sustained 4.27GHz in Cinebench, and stopped linx getting close to 90C, so I've left it there. (Just raising it when I
want to stress it).
Given all of this, a BCLK overclock is certainly more consistent for peak performance, but the multiplier overclock turns out to be great fun with just as many knobs. Maybe can't quite reach the heights of a BCLK overclock, but perhaps a better daily drive.