Thursday, January 31st 2019
Intel Officially Sinks the Itanic, Future of IA-64 Architecture Uncertain
Intel has unceremoniously, through a product change notification (PCN), discontinued the Itanium family of microprocessors. The Itanium 9700 "Kittson," which was released in 2017, is the final generation of Itanium, and its sales to new customers have stopped according to the PCN. The series has been marked "end of life" (EOL). Existing customers of Itanium who already have their IT infrastructure built around Itanium 9700 series, have an opportunity to determine their remaining demand of these processors, and place their "Last Product Discontinuance" order with Intel. The final LPD shipments would go out mid-2021.
With this move, Intel has cast uncertainty over the future of the IA-64 microarchitecture. IA-64 was originally conceived by Intel to replace 32-bit x86 at the turn of the century, as an industry-standard 64-bit processor architecture. AMD laid the foundation for its rival standard AMD64, which could go on to become x86-64. AMD64 won the battle for popularity over IA-64, as it maintained complete backwards-compatibility with x86, and could seamlessly run 32-bit software, saving enterprises and clients billions in transition costs. Intel cross-licensed it as EM64T (extended memory 64-bit technology), before standardizing the name x86-64. Itanium dragged on for close to two decades serving certain enterprise and HPC customers.
Source:
Intel (PDF document)
With this move, Intel has cast uncertainty over the future of the IA-64 microarchitecture. IA-64 was originally conceived by Intel to replace 32-bit x86 at the turn of the century, as an industry-standard 64-bit processor architecture. AMD laid the foundation for its rival standard AMD64, which could go on to become x86-64. AMD64 won the battle for popularity over IA-64, as it maintained complete backwards-compatibility with x86, and could seamlessly run 32-bit software, saving enterprises and clients billions in transition costs. Intel cross-licensed it as EM64T (extended memory 64-bit technology), before standardizing the name x86-64. Itanium dragged on for close to two decades serving certain enterprise and HPC customers.
54 Comments on Intel Officially Sinks the Itanic, Future of IA-64 Architecture Uncertain
You bought Compaq and killed Alpha :P
Now your Itanic has sunk.
Their execs were all having too many drug parties at intel, apparently. You'd have to be higher than a weather balloon to invest in itanium.
Killed PA-Risc
And Dec Alpha
to bring out the Itanic with intel..
Even microsoft saw a sinking ship and pulled out years ago.
Can't remember how many billion the itanic it cost.
And how many % of the server and even workstation market they where meant to get.
Even before AM64 they where behind target then AM64 came out and it was pretty much game over. Either way intel HP created the Itanic ....
Xeon 20.1474609375
Itanium 8.8193359375
Opteron 11.516927083333333333333333333333
Itanium 13.1015625
That's per core. Itanium 2 is nothing to scoff at.
8-Core Itanium Poulson: 3.1 billion transistors
8-Core Xeon Nehalem-EX: 2.3 billion transistors
Interesting article about Poulson (newest Itanium architecture): www.realworldtech.com/poulson/
Itanium had 20% of the TOP 500 super computers back in 2004. IA-64 gained traction because x86 lacked memory addressing space. x86-64 reversed that pattern because of backwards compatibility/not having to find Itanium software developers.
12 instructions per clock, 8 cores, and 16 threads at the end of 2012. It was a monster.
Yes, the actual address space support today is more like 40-bit (2^40 ~ 1 TB) or 52-bit (2^52 ~ 4.5 PB ~ 4500 TB) for physical and 48-bit for virtual (2^48 ~ 280 TB) not the full 64-bit, but moving that up is a fairly minor change in terms of architecture and it'll take a while until we exhaust the 64-bit address space (2^64 = 16 EB ~ 16.7 million TB).
64-bit needs the data path, ALUs (integer, which is used for address calculations), registers, address and data buses to be 64-bit which doubled almost everything in a CPU or a CPU core compared to 32-bit CPUs. Doubling all that again to 128-bit does not sound like something CPUs would benefit from - today and in general use. For an example on that, see what happened to Intel's FP units in terms of size, power and heat when they doubled their size from 128-bit to 256-bit for AVX2 in Haswell.
64-bit computing have nothing to do with 64-bit address width.
Physical Address Extension (PAE) to address beyond 4 GB was supported since Pentium Pro (1995).
PAE is a workaround. It is often not enough and has downsides, not least of which is enabling support for it on every level.
PAE was supported on Windows, MacOS(x86), Linux and all the major BSDs. Windows 8 and 10 32-bit actually requires to run in PAE mode, so it's used much more than you think.
The reason why PAE is unknown to most people, is that they switched to 64-bit OS and hardware long before they hit the 4 GB limit.
I remember PAE very well. It needed support from motherboard, BIOS, operating system and depending on circumstances, application. That was a lot of fun :)
Are you sure about 32-bit Windows 8 and 10 requiring PAE? They do support it and can benefit from it but I remember trying to turn on PAE manually on Windows 8 (and failing due to stupid hardware).
Just saying. IA-64 is good on paper, but performance completely relies on implementation of both the software and the compiler and the compiler is actually responsible for a lot than one for x86-64 is.
Having to do multiple operations to access memory is of course a disadvantage, but not a huge one. I remember most recompiled software got like a 5-10% improvement, due to easier memory access and faster integer math combined. I haven't run any Windows in 32-bit since XP, but from what I've read does NX bit require it, which is enabled on all modern operating systems for security reasons.
Nevertheless, I was one of the early adapters of 64-bit OS's, not because of memory, but because I wanted that extra 5-10% performance. Linux did have an extra advantage here, since the entire software libraries were made available in 64-bit almost immediately. And it was a larger uplift than many might think. Most 32-bit software (even on Windows) was compiled with i386 ISA, yes that means 80386 compatible features only. Some heavier applications was of course compiled with later ISA versions, but most software were not. Linux software also usually assumed SSE2 support along with "AMD64", so the difference could be quite substantial in edge cases. The problem from the compiler side is that the code, regardless of language, needs to be structured in a way that the compiler can basically saturate these resources.
If you write even C/C++ code without considerations, not even the best compiler imaginable can restructure the overall code for it to be efficient for Itanium.
This is basically the same problem we have with writing for AVX, and the reason why all efficient AVX code uses intrinsics, which is "almost" assembly. In theory, having many registers is beneficial. At machine code level, x86 code does a lot of moving around between registers (which usually is completely wasted cycles, ARM does a lot more…). So having more registers (even if only on the ISA level) can eliminate operations and therefore be beneficial, I have no issues so far.
But, keep in mind that ISA and microarchitecture are two different things. x86 on the ISA level is not superscalar, while every modern microarchitecture is. What registers a CPU have available on an execution port is entirely dependent on their architecture, and it varies. And this is the sort of thing that a CPU is actually able to optimize within the instruction window.
Having too many general purpose registers on the microarchitecture level will get challenging, because it complicates the pipeline and is likely to introduce latency.
So to sum up, I'm all for having more registers on the ISA level, but on the microarchitecture level it should be entirely up to the designer. Current x86 designs have 4(Skylake)/4+2?(Zen) execution ports for ALUs, etc. and vector units. As this increases in the future, I would expect improvements on the ISA level could help simplify the work for the front-end in the CPUs.
For example, you what happens with x86/x86-64 when it tries to speed up code execution? It tries to predict whether a code path will be chosen and execute it ahead of time using idle resources. The problem is if it turns out the prediction was wrong, the pipeline has to be flushed and new instructions brought in. You know what Itanium does/did? It doesn't try to predict anything, it will execute both branches of a conditional statement and pick whichever is needed when the time comes.
Intel wasn't nuts in coming up with Itanium. It's just that everybody chose x86-64 instead.
One of the fundamental problems for CPUs is that the CPU have less context than the author of the code does. If you for example write a function where the value of a variable is determined by one or more conditionals, but the remaining control flow is unchanged. But by the time the code is turned into machine code this information is usually lost; all the CPU sees is calculation, conditionals, access patterns etc. within just a tiny instruction window. There are a few cases where a compiler can optimize certain code into operations like conditional move, which eliminates what I call "false branching", but usually compiler optimizations like this require the code to be in a very specific form to detect it correctly, unless the coder uses intrinsics. This is an area where x86 could improve a lot, with of course some changes in compilers and coding practices to go along with it.
Ultimately code executions comes down to cache misses and branching, and dealing with these in a sensible manner will determine the speed of the code, regardless of programming language. There is not going to be a wonderful new CPU ISA which solves this automatically. Unfortunately, most code today consists of more conditionals, function calls and random access patterns than than code which actually does something, and code like this will never be fast.
Bit like tessellation in R9800 Pro. :)
It's a done deal though, it only matters to historians and future business decisions how.
Also keep in mind that the CPU can't see beyond a memory access until it's dereferenced, and the same with any memory access with a data dependency, like
The CPU will try to execute these out-of-order as early as possible, but that will only save a few clock cycles of idle. The cost of a cache miss is up to ~400 clocks for Skylake, and for a misprediction it's up to 19 cycles for the flush plus any delays from fetching the new instructions, which can be even a instruction cache miss if it's a long jump! The instruction window for Skylake is 224, and I believe it can decode 6 instructions or so per cycle, so it doesn't take a lot before it's virtually "walking blindly". And as you can see, even a single random memory access can't be found in time to prefetch it, and often there are multiple data dependencies in a chain, leaving the CPU stalled most of the time. The only memory accesses it can do ahead of time without a stall are linear accesses, where it guesses beyond the instruction window. Something as simple as a function call or a pointer dereference will in most cases cause a cache miss. The same with innocent looking conditionals, like: Put something like this inside a loop and you'll kill performance very quickly. Even worse, function calls with inheritance in OOP; while it might suit your coding desires, doing it in a critical part of the code can easily make a peformance difference of >100×. ;)