Monday, January 22nd 2024
Intel 15th-Generation Arrow Lake-S Could Abandon Hyper-Threading Technology
A leaked Intel documentation we reported on a few days ago covered the Arrow Lake-S platform and some implementation details. However, there was an interesting catch in the file. The leaked document indicates that the upcoming 15th-Generation Arrow Lake desktop CPUs could lack Hyper-Threading (HT) support. The technical memo lists Arrow Lake's expected eight performance cores without any threads enabled via SMT. This aligns with previous rumors of Hyper-Threading removal. Losing Hyper-Threading could significantly impact Arrow Lake's multi-threaded application performance versus its Raptor Lake predecessors. Estimates suggest HT provides a 10-15% speedup across heavily-threaded workloads by enabling logical cores. However, for gaming, disabling HT has negligible impact and can even boost FPS in some titles. So Arrow Lake may still hit Intel's rumored 30% gaming performance targets through architectural improvements alone.
However, a replacement for the traditional HT is likely to come in the form of Rentable Units. This new approach is a response to the adoption of a hybrid core architecture, which has seen an increase in applications leveraging low-power E-cores for enhanced performance and efficiency. Rentable Units are a more efficient pseudo-multi-threaded solution that splits the first thread of incoming instructions into two partitions, assigning them to different cores based on complexity. Rentable Units will use timers and counters to measure P/E core utilization and send parts of the thread to each core for processing. This inherently requires larger cache sizes, where Arrow Lake is rumored to have 3 MB of L2 cache per core. Arrow Lake is also noted to support faster DDR5-6400 memory. But between higher clocks, more E-cores, and various core architecture updates, raw throughput metrics may not change much without Hyper-Threading.
Source:
3DCenter.org
However, a replacement for the traditional HT is likely to come in the form of Rentable Units. This new approach is a response to the adoption of a hybrid core architecture, which has seen an increase in applications leveraging low-power E-cores for enhanced performance and efficiency. Rentable Units are a more efficient pseudo-multi-threaded solution that splits the first thread of incoming instructions into two partitions, assigning them to different cores based on complexity. Rentable Units will use timers and counters to measure P/E core utilization and send parts of the thread to each core for processing. This inherently requires larger cache sizes, where Arrow Lake is rumored to have 3 MB of L2 cache per core. Arrow Lake is also noted to support faster DDR5-6400 memory. But between higher clocks, more E-cores, and various core architecture updates, raw throughput metrics may not change much without Hyper-Threading.
100 Comments on Intel 15th-Generation Arrow Lake-S Could Abandon Hyper-Threading Technology
Editing raw/lossless video is the only time when it should be acceptable to give up CPU resource on video playback.
Given that software is open-source, and that schedulers for big.LITTLE remains poor in practice (even if they theoretically can be fixed), there's a lot of sense in waiting for the future algorithms to be implemented, rather than creating a big.LITTLE chip prematurely.
-----------
Its not even clear if big.LITTLE will be a better plan than SMT (2-threads per core, or even IBM style 4-threads per core or 8-threads per core). SMT as you points out, is symmetrical. All cores can be treated the same and equivalent, which grossly eases any scheduling algorithm.
big.LITTLE remains a big deal however, because the nature of multithreading vs single-threaded applications naturally lines up to big cores vs little-cores. Long-running background tasks tend to need to be low-latency, and a full core on very low power like a LITTLE-core, is the ideal processor. Short high-speed video game / number-crunching user-interface spiffyness relies upon big-cores however. Apple M1 and Android UIs need surprising amounts of compute to remain responsive in all scenarios. But how do we write an operating-system scheduler that can automatically determine which threads or processes should go on which cores?
OTOH, there's an equally compelling argument that Intel should be providing a "scheduling driver" for their CPUs, that replaces the default one used by Windows. How much work it would be to update the NT kernel to support that level of flexibility, is an unknown, but given how long that kernel has been around and how much tech debt it's built up, I'd say "a lot". Which leads back to my first point about Microsoft not caring anymore.
Programmers in Linux and Windows land both have access to "Core Affinity" flags. (learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setprocessaffinitymask) and (man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html). Modern programmers likely will have to pay attention to affinity more often. Those without affinity will default to LITTLE cores (which are more plentiful, and will encourage programmers to pick big-cores when they need it).
Or... something like that. On the other hand, maybe this would just cause a bunch of programmers to copy/paste code to always use big-cores and that's also counter-productive. Hmmmmmmmmm. Its almost an economics game, the OS-writers don't really control the application-programmers. (Ex: Windows developers are unable to control video-game programmers from Call of Duty or whatever). But a system needs to be setup so that the resources of the computer are optimized.
-------
Given how things are right now, a scheduler is needed. But I'm not 100% convinced that a scheduler is the right solution overall.
>>...
>>... Estimates suggest HT provides a 10-15% speedup across heavily-threaded workloads by enabling logical cores...
No and No for both cases. Period.
I regret to see that speculations regarding performance of HTT-enabled processing continues. Unfortunately, there is still misunderstanding of how actually HTT works!
Please take a look at a Video Technical Report:
Intel Hyper Threading Technology and Linpack Benchmark ( VTR-015 )
which I've published in May 2019. Take a look at Slide 19 and Slide 20 ( performance data and graphs for LINPACK tests ).
There are also performance data for matrix multiplication algorithms, Intel MKL vs. Strassen for Single-precision ( 24-bit ) and Double-precision ( 53-bit ).
I'd like to repeat that Peak Processing Power of HTT-enabled applications is achieved when only one, and Only One, out of two Logical Processors is used of the Physical Core.
As CPU frontends have become vastly more efficient, the relative potential has decreased. Along with ever more complex and wide CPU designs, security concerns and added complexity have made HT ever more costly to implement and maintain. It has long been overdue for a replacement, or just dropping it outright. These are development resources and die space which could be better spent. Only indirectly, in terms of how many threads are spawn, etc. Depends a lot on the workloads. SMT(HT) does wonders for some, not for others, and can sometimes introduce a lot of latency too.
We have to remember that a new design without HT wouldn't be the same as turning HT off on an old design. This would mean Intel could have prioritized a lot of resources on other features, either a replacement or other design considerations. So unless they screwed up*, there will be highly likely new benefits from dropping HT.
*) With large overhauls there is higher risk of unforeseen problems to delays, or even lead to disabled features. Assuming Arrow Lake launches late this summer or fall, then the entire design was completed by summer 2023 (tape out), and the main design long before that.
So I think it's safe to assume that Intel and their trusted partners know which features are coming. ;) A little question on the side;
Do you have experience with the best approach to query each thread to find out various capabilities (SMT, big or little, and any future differences)? Is the CPUID instruction the preferred choice, or OS APIs?
I know that there's various NUMA-code to query the capabilities of sets-of-cores. learn.microsoft.com/en-us/windows/win32/procthread/numa-support . I expect Linux to have similar APIs though named differently of course. The main issue is that NUMA is about memory differences, not core-differences. So the focus of NUMA APIs is closer to malloc/free. (Yes, some bits of memory are on Core#1 or Core#50... but NUMA has a focus on memory). With big.LITTLE, there's newer APIs that handle the different cores but I don't know them quite as well.
For HPC, a common pattern I've seen is to have a startup-benchmark routine, where you attempt different strategies and perform a bit of self-optimizing / self-parameterization. So in practice, even the highest-performance programs ignore a lot of this, just collect practical data, and then self-tune around the problem.
Here is a piece of test-code I used for Windows ( any version that supports Get/Set ThreadAffinity API ) in order to control threads affinity:
...
SYSTEM_INFO si = { 0 };
::GetSystemInfo( &si );
RTBOOL bRc = RTFALSE;
RThandle hProcess = RTnull;
RThandle hThread = RTnull;
RTulong dwProcessMask = 0;
RTulong dwSystemMask = 0;
RTulong dwThreadAM = 0;
RTulong dwThreadAMPrev = 0;
RTulong dwThread1PrefferedCPU = 0;
DWORD dwErrorCode = 0;
hProcess = SysGetCurrentProcess();
hThread = SysGetCurrentThread();
bRc = ::GetProcessAffinityMask( hProcess, ( PDWORD_PTR )&dwProcessMask, ( PDWORD_PTR )&dwSystemMask );
RTint iCpuNum = ( 8 - 1 );
RTint iThreadAffinityMask = _RUN_ON_CPU_08; // Default Logical CPU 07 at the beginning of Verification
// Take into account that Logical CPUs are numbered from 0
dwThreadAMPrev = ::SetThreadAffinityMask( hThread, iThreadAffinityMask );
SysSleep( 0 );
dwErrorCode = SysGetLastError();
CrtPrintf( RTU("\t\tSwitched to Logical CPU%d - Previous Thread AM: %3d - Error Code: %3d\n"),
iCpuNum,
dwThreadAMPrev, dwErrorCode );
for( RTuint i = 0; i < ( ( RTuint )16777216 * 128 ); i += 1 )
{
volatile RTfloat fX = 32.0f;
fX = ( RTfloat )i * ( fX * 2 ) * ( fX * 4 ) * ( fX * 8 );
}
SysSleep( 5000 );
iCpuNum = 0;
for( iThreadAffinityMask = 1; iThreadAffinityMask < 256; iThreadAffinityMask *= 2 )
{
iCpuNum++;
dwThreadAMPrev = ::SetThreadAffinityMask( hThread, iThreadAffinityMask );
SysSleep( 0 );
dwErrorCode = SysGetLastError();
CrtPrintf( RTU("\t\tSwitched to Logical CPU%d - Previous Thread AM: %3d - Error Code: %3d - Thread Affinity: %3d\n"),
( iCpuNum - 1 ),
dwThreadAMPrev, dwErrorCode, iThreadAffinityMask );
for( RTuint i = 0; i < ( ( RTuint )16777216 * 128 ); i += 1 )
{
volatile RTfloat fX = 32.0f;
fX = ( RTfloat )i * ( fX * 2 ) * ( fX * 4 ) * ( fX * 8 );
}
SysSleep( 5000 );
}
... It is very easy to get Time Stamp Counter value for a Logical CPU using RDTSC instruction:
...
// Test-Case 3 - Retrieving RDTSC values for Logical CPUs
{
CrtPrintf( RTU("\n\tTest-Case 3 - Retrieving RDTSC values for Logical CPUs - 1\n") );
RTBOOL bRc = RTFALSE;
RThandle hProcess = RTnull;
RThandle hThread = RTnull;
RTulong dwProcessMask = 0;
RTulong dwSystemMask = 0;
RTulong dwThreadAM = 0;
RTulong dwThreadAMPrev1 = 0;
RTulong dwThreadAMPrev2 = 0;
RTulong dwThread1PrefferedCPU = 0;
ClockV cvRdtscCPU1 = { 0 }; // RDTSC Value for Logical CPU1
ClockV cvRdtscCPU2 = { 0 }; // RDTSC Value for Logical CPU2
while( RTtrue )
{
hProcess = SysGetCurrentProcess();
hThread = SysGetCurrentThread();
bRc = ::GetProcessAffinityMask( hProcess, ( PDWORD_PTR )&dwProcessMask, ( PDWORD_PTR )&dwSystemMask );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ GetProcessAffinityMask ] failed\n") );
break;
}
bRc = SysSetPriorityClass( hProcess, REALTIME_PRIORITY_CLASS );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ SetPriorityClass ] failed\n") );
break;
}
bRc = SysSetThreadPriority( hThread, THREAD_PRIORITY_TIME_CRITICAL );
if( bRc == RTFALSE )
{
CrtPrintf( RTU("\t\tError: [ SetThreadPriority ] failed\n") );
break;
}
dwThreadAMPrev1 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_01 );
SysSleep( 0 );
cvRdtscCPU1.uiClockV = __rdtsc();
dwThreadAMPrev1 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_01 );
// dwThreadAMPrev2 = ::SetThreadAffinityMask( hThread, _RUN_ON_CPU_02 );
SysSleep( 0 );
cvRdtscCPU2.uiClockV = __rdtsc();
SysSetPriorityClass( hProcess, NORMAL_PRIORITY_CLASS );
SysSetThreadPriority( hThread, THREAD_PRIORITY_NORMAL );
CrtPrintf( RTU("\t\tRDTSC for Logical CPU1 : %.0f\n"), ( RTfloat )cvRdtscCPU1.uiClockV );
CrtPrintf( RTU("\t\tRDTSC for Logical CPU2 : %.0f\n"), ( RTfloat )cvRdtscCPU2.uiClockV );
CrtPrintf( RTU("\t\tRDTSC Difference: %.0f ( RDTSC2 - RDTSC1 )\n"),
( RTfloat )( cvRdtscCPU2.uiClockV - cvRdtscCPU1.uiClockV ) );
CrtPrintf( RTU("\t\tdwThreadAMPrev1 : %3d ( Processing Error if 0 )\n"), dwThreadAMPrev1 );
CrtPrintf( RTU("\t\tdwThreadAMPrev2 : %3d ( Processing Error if 0 )\n"), dwThreadAMPrev2 );
break;
}
}
...
Sleep( 5000 ); // 5 seconds
...
It is used to see a Logical Processor switch in Windows Task Manager. Please take a look at:
Time Stamp Counters of Logical CPUs on a Multi-Core Computer System with Windows 7 ( VTR-184 )
in order to see how Logical Processors are switched during real time test processing.
I'd like to mention one more thing and it is Very Important to call Sleep( 0 ) after the switch is done because a couple of hundreds of nanoseconds are needed to make a real physical switch, for example, from a Logical Processor 1 to a Logical Processor 2.
If CPUs continue to become more and more diverse, having generic and reliable ways to determine capabilities would be necessary, to scale well across generations as well as anything from low-end CPUs to high-end or multi-CPUs. Certain software could be very sensitive, and this could ultimately impact end user's purchasing choices.
For HPC users running configurable or even custom software, I can imagine doing even manual calibration would be desirable, they usually don't run on 1000 different hardware configurations. :) Thanks. I've saved that for later. :)
I haven't had time to dive into handling "hybrid" CPU designs yet, but probably I have to eventually.
P.S. you might want to throw a spoiler tag around that code. ;)
Imagine you have 2 threads munching around on a hyperthreaded CPU. It is imperative that you put them onto different real cores. That scheme breaks down if you have more than one application on the machine. Even if you just have multiple instances of the same program. You can have each instance grab all the premium CPU resources.
When there are conflicting workloads there needs to be a separate authority to distribute compute resources, including access to premium cores. That is the OS scheduler.
I have personally experimented with deliberate core placement (in a server-class application) and I always gave up because the OS scheduler did a better job as far as overall system throughput is concerned.
But the register files, reorder-buffer, L1 / L2 data-cache, vector-units, pipelines, and more are all shared / recycled. That's more than 80% of the core. The remaining 20% (decoder, branch prediction, L1-code cache) is still needed even for single-thread-per-core applications, but obviously needs to be bigger for a Hyperthread.
Part of the problem here is that Windows, which supposedly has the most advanced integration with Intel's "thread director", is not open source. We can see the Linux scheduler, but they are fiddling so much with it just in the last 2 months that it is hard to see what is going on. Yeah, but how much bigger is a current P-core than a current E-core? I thought it is a factor of 10. So the math might still work out in favor of E-cores instead of HT - die-space wise.
I'd estimate 5x E-cores vs 1x P-Core, just spitballing by looking at this image. A lot of this is because 768kB of L2 cache per E-core (3MB shared between 4x E-cores) is just naturally going to be smaller than 2MB per P-core
You're right that this is a magnitude smaller than I thought, though not quite at the 1-to-10 odds like you initially assumed.
That's why I think it's mainly useful to know the amount of different classes of resources; like P-cores and E-cores, and whether these have SMT or not. And for non-x86, whether SMT is 2-way, 4-way or 8-way, or if there are more exotic core configurations (aren't there ARM designs with three different cores?). Assuming all "threads" are equal can result in sub-optimal performance in synchronous workloads.
As we all know, no piece of code will scale perfectly under all circumstances, but at least it will be useful to have some kind of feature detection so an application/game doesn't completely "sabotage" itself if Intel or AMD releases a new "unusual" P-core/E-core mix. :)
It means, that in case of HPC- and Floating-Point-arithmetic-based processing a Floating Point Unit ( FPU ) needs to be used just by one thread (!). This is because there is just one FPU in a core and it is shared between Logical Processors.
For example, for Intel Xeon Phi processors with 64 cores and 4 hardware threads for a core ( 256 Logical Processors ) only one thread for a core needs to be used to achieve a Peak Processing Power. I've verified that rule on Intel Xeon Phi Processor 7210 and here are its specs:
ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
Intel Xeon Phi Processor 7210 ( 16GB, 1.30 GHz, 64 core )
Processor name: Intel Xeon Phi 7210
Packages ( sockets ): 1
Cores: 64
Processors ( CPUs ): 256
Cores per package: 64
Threads per core: 4
Peak Processing Power: 2.662 TFLOPs Calculated as follows: 1.30 * 64 * ( 512-bit / 32-bit ) * 2 / Note: Single-Precision ( 23-bit ) data type For a Quad-core processor with two hardware threads a bar in the Windows Task Manager usually reaches ~98%-99% when one hardware thread for the core is used.
When thread affinity control is Not used a total sum could Not be equal to 100% because of a Non Deterministic nature of Non Real Time Operating Systems.
Edit: skip to 5 min for the full facepalm. Dude renamed the video after folks started correcting him in the comments.
Microsoft sits at like 99% of marketshare in the enterprise sector
Linux dominates servers yeah, but desktop, Linux is at a sub 1% marketshare, even MacOS is much higher but Windows dominate by 90-95% or so
I use Arch Linux and Debi for my own servers, but for desktop I'd not be touching it for sure