Tuesday, January 5th 2021
AMD Applies for CPU Design Patent Featuring Core-Integrated FPGA Elements
AMD has applied for a United States Patent that describes a CPU design with FPGA (Field-Programmable Gate Array) elements integrated into its core design. Titled "Method and Apparatus for Efficient Programmable Instructions in Computer Systems", the patent application describes a CPU with FPGA elements inscribed into its very core design, where the FPGA elements actually share CPU resources such as registers for floating-point and integer execution units. This patent undoubtedly comes in the wake of AMD's announced Xilinx acquisition plans, and brings FPGA and CPU marriages to a whole other level. FPGA,as the name implies, are hardware constructions which can reconfigure themselves according to predetermined tables (which can also be updated) to execute desired and specific functions.
Intel have themselves already shipped a CPU + FPGA combo in the same package; the company's Xeon 6138P, for example, includes an Arria 10 GX 1150 FPGA on-package, offering 1,150,000 logic elements. However, this is simply a CPU + FPGA combo on the same substrate; not a native, core-integrated FPGA design. Intel's product has severe performance and latency penalties due to the fact that complex operations performed in the FPGA have to be brought out of the CPU, processed in the FPGA, and then its results have to be returned to the CPU. AMD's design effectively ditches that particular roundabout, and should thus allow for much higher performance.Some of the more interesting claims in the patent application are listed below:
This would also allow AMD to trim the CPU of the "dark silicon" that is currently present - essentially, highly specialized hardware acceleration blocks that sit idly, as a waste of die space, when not in use. The bottom line is this: CPUs with lower die space reserved for highly specialized operations, thus with more die area available for other resources (such as more cores), and with integrated, per-core FPGA elements that would on-the-fly reconfigure themselves according to processing needs. And if there are no exotic operations required (such as AI inferencing and acceleration, AVX (for example), video hardware acceleration, or other workloads, then the FPGA elements can just be reconfigured to "turbo" the CPU's own floating point and integer units, increasing available resources. An interesting patent application, for sure.
Sources:
Free Patents Online, Reddit user @ Marakeshmode, Hot Hardware
Intel have themselves already shipped a CPU + FPGA combo in the same package; the company's Xeon 6138P, for example, includes an Arria 10 GX 1150 FPGA on-package, offering 1,150,000 logic elements. However, this is simply a CPU + FPGA combo on the same substrate; not a native, core-integrated FPGA design. Intel's product has severe performance and latency penalties due to the fact that complex operations performed in the FPGA have to be brought out of the CPU, processed in the FPGA, and then its results have to be returned to the CPU. AMD's design effectively ditches that particular roundabout, and should thus allow for much higher performance.Some of the more interesting claims in the patent application are listed below:
- Processor includes one or more reprogrammable execution units which can be programmed to execute different types of customized instructions
- When a processor loads a program, it also loads a bitfile associated with the program which programs the PEU to execute the customized instruction
- Decode and dispatch unit of the CPU automatically dispatches the specialized instructions to the proper PEUs
- PEU shares registers with the FP and Int EUs.
- PEU can accelerate Int or FP workloads as well if speedup is desired
- PEU can be virtualized while still using system security features
- Each PEU can be programmed differently from other PEUs in the system
- PEUs can operate on data formats that are not typical FP32/FP64 (e.g. Bfloat16, FP16, Sparse FP16, whatever else they want to come up with) to accelerate machine learning, without needing to wait for new silicon to be made to process those data types.
- PEUs can be reprogrammed on-the-fly (during runtime)
- PEUs can be tuned to maximize performance based on the workload
- PEUs can massively increase IPC by doing more complex work in a single cycle
This would also allow AMD to trim the CPU of the "dark silicon" that is currently present - essentially, highly specialized hardware acceleration blocks that sit idly, as a waste of die space, when not in use. The bottom line is this: CPUs with lower die space reserved for highly specialized operations, thus with more die area available for other resources (such as more cores), and with integrated, per-core FPGA elements that would on-the-fly reconfigure themselves according to processing needs. And if there are no exotic operations required (such as AI inferencing and acceleration, AVX (for example), video hardware acceleration, or other workloads, then the FPGA elements can just be reconfigured to "turbo" the CPU's own floating point and integer units, increasing available resources. An interesting patent application, for sure.
24 Comments on AMD Applies for CPU Design Patent Featuring Core-Integrated FPGA Elements
This is the future, Full CPU with on the fly reprogramable units. They "morph" in the shape that is the most efficient for the calculation that needs to be done. Its as if you have a billion cpus into one. This is the future for sure.
Imagine having a 512 or 1024 core on 0.5 nanometer cpu like this, would blast curent high end cpus like the 5950x into oblivion like they were nothing. In all kind of workloads.
Can those "FPGA parts" emulate old stuff? Like 32bit code? I mean, that could probably help AMD clean up all the old stuff in their designs, that is there for compatibility perposes and help streamline their future cores?
In summary FPGAs are much more flexible than dedicated hardware while generally faster than performing a function in software. They're sort of a middle ground.
software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions.html
software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions/intrinsics-for-intel-advanced-matrix-extensions-amx-tile-instructions.html
software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions/intrinsic-for-intel-advanced-matrix-extensions-amx-bf16-instructions.html
software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions/intrinsics-for-intel-advanced-matrix-extensions-amx-int8-instructions.html
Also, perhaps we could actually see improved performance on tasks performed by fixed-function hardware. I suppose in theory, if one can shave 3x 20mm2 (pulling this out of my proverbial, as an example) fixed function hardware for three specific tasks, and replace those with 60 mm2 FPGA, perhaps those 60 mm2 FPGA resources will be faster at executing one of those tasks than their previous 20mm2 fixed-function hardware?
* Manufacturing tech. For example, the process for making CPUs is significantly different from those for DRAM and NAND - two of those can't be combined on the same die efficiently. So, is the process that produces the best CPU logic also the best for FPGA logic?
* Can a CPU really make full advantage of flexible execution units if all other parts remain fixed, like decode logic, out-of-order logic, etc.? For 32-bit emulation, I believe that programmable decoders would be the key, not programmable EUs.
* Context switching might require reprogramming the FPGA logic every time, or often, depending on the load. How much time does it take? Not very good if it's many microseconds.
* The poor guy that writes highly optimized C/C++ code, will he (or she) have to become a VHDL expert too? (or will that be left up to the other poor guy, the on that maintains the optimizing compiler?)
If all this ever sees the light of day, I imagine AMD will make various FPGA functions available as downloads, for a fee of course. Maybe they won't let just anyone develop or sell new ones.
CPU: you could help me
FPGA: eh?
CPU: Lazy bum!
FPGA: Ok fine here, Haz some Red Bull, it'll make you run faster.
CPU: I dont have legs!
FPGA: better I stick a Falcon 9 up your rear?
CPU: Zoom! ZOOM!
is this what it seems like, with AMD making their CPU's hardware functions reprogammable so they can just change the damn architecture and feature set on a whim?
Of these 3 part of a CPU, they want to be able to use these programable transitors to replace 2 of them.
The legacy support (to support old x86 apps that use instruction that modern apps no longer use.).
Accelerators like AI, Image processing, video decoding/encoding, Encryption/Decryption, etc.
The core of the CPU will remain normal static transitors.
But the goal would be to use the space saved as they won't require as much space and transitors for many legacy instruction/code or accelerators for something that might provide better performance. Like more core, larger cores, more cache, etc...
I guess this is how AMD is seeing things like AVX512, AI and other stuff, just put an FPGA there. Developers can just program it and do their magic. But I don't know how much it can do over specialised ASICS like how Intel is doing with AVX512, and how the FPGA will work with multiple applications each trying to do its own thing. The patent I saw was like each x86 core has a small FPGA beside it and both share resources (like how x86 core has integer and FP units, now we will have an FPGA unit as well). So each core can have their own FPGA and each core can program it's FPGA to do specific task (or combine more than one core with their FPGA to have more FPGA power).
Maybe int he future, a single x86 core na have multiple FPGA execution units like how they do with integer and FP units. And maybe AMD can differentiate Server and consumer Zen Core dies buy how many FPGA units per core/die as I don't think consumers will need that much in the near future.
Far as the FPGA matter is concerned I feel AMD full well intends to expand FPGA power over time with refinement and also use it to help redesign maximize what's ideal in terms of fixed function instruction sets to keep in place and which could be shifted away from fixed function instructions per core to FPGA silicone die space instead of lesser importance instruction set algorithms and other chip aspects. It could lead to something like a chiplet with 8 cores within it and each potentially could have it's own unique fixed instruction set the saved space on the rest used to FPGA space that's programmable. It could lead to a chip where the first core has some FPGA parts and all instruction sets you'd want and the next core drops a instruction down the line per additional core and replaces that instruction set space for additional FPGA space. That would enable a fair degree of programmable micro adjustments a lot like precision boost with voltages. How they work out which ways Windows Task Scheduler handles it is another matter, but it'll work itself out over time I'm sure.
To touch on what I said a few months back "I think bigLITTLE is something to think about and perhaps some FPGA tech being applied to designs. I wonder if perhaps the MB chipset will be turned into a FPGA or incorporate some of that tech same with CPU/GPU just re-route some new designs and/or re-configure them a bit depending on need they are wonderfully flexible in a great way perfect no, but they'll certainly improve and be even more useful. Unused USB/PCI-E/M.2 slots cool I'll be reusing that for X or Y. I think eventually it could get to that point perhaps hopefully and if it can be and efficiently that cool as hell." That's something I feel is another aspect of FPGA's being integrated and fused with CPU's. The CPU's these days have fixed hardware to handle a lot of stuff even things like direct CPU based PCIE connections. What happens with fixed function hardware is if that hardware isn't being utilized fully it's effectively wasted die space is it not!!?
Now with FPGA's handling some of those things and depending on how much extra space is required to do so you can actually bypass that downfall to fixed function hardware design not utilizing something repurpose the die space for something you're trying to do like sorcery. The traditional chipset could be eliminated in the future entirely replaced by a FPGA potentially or at least more of a twin socket CPU and no chipset with a infinity cache and infinity fabric connection between them both. They could even behave more like the human brain across a motherboard left one handle memory channels on the left side and other on the right side along with peripherals like PCIE lanes with shorter traces by making the PCIE distance between the CPU socket more symmetric. That could be part of the issue with mGPU as well the PCIE traces differ a fair amount because of the slot location nearer and further away in relation to the CPU. That's certainly a area that could be improved in practice.
Another part to touch on few months back I mentioned and feel is quite true. Eventually we need even more fixed function ASIC functions integrated into chip dies or FPGA's because the low hanging fruit on node shrinks is eroding. Where FPGA's come into the equation is die space is limited, but their programmability isn't though the extent it certainly is. That said you can't infinity put new ASIC fixed instructions into a chip die with the laws of physics diminishing the prospects of node shrinks at some stage or another. We either need a cost effective quantum computer break-thru of some sort or FPGA's cleverly being purposed and a delicate balance of critical fixed function instruction sets.
"I still really feel FPGA's could be the best all around solution outside of combining a variety of ASIC's to really specifically maximize and prioritize a handful of the individuals use cases. Eventually this will be one of the few low hanging fruits left to leverage so it has to happen eventually for both Intel and AMD not to mention Nvidia on the GPU side this is how it is going to be moving forward one way or another w/o a break thru on the manufacturing side or quantum computers really taking a foothold."
From like 3 years ago...the tides have turned the future is now!
"I hate to say it because I wish AMD luck in the future as they are a great company, but I strongly feel that Intel's FPGA is a enormous sleeping giant. FPGA's in general have so much potential to me as they can be configured appropriately to specific needs. I'm not sure why we don't have a CPU FPGA swamped with like 8 FPGA's around it that interconnect with it. You'd have tons of surface area for cooling and enormous amounts of reconfigurable power at hand especially if you had that with something like very potent APU at it's center with lots of AI machine learning capability to adapt to a users usage and needs."
In short, I don't like NVIDIA RT solution at all - it's wasteful (and consumers pays for it), it's incomplete and doomed to become obsolete in time.
If part of cores are capable to do a quick change from rasterization to (any) RT or vice-versa, it's a very flexible and elegant solution.
Goes to various other stuff, say anti-aliasing - some games clearly don't need it at all (though they don't usually need high computing power too, but lets think small stuff, like APUs). And a number of others. Not needed - used for something else. Especially in APUs and low-end.
Also - reasonably future-proof. As bloody TFOPs might actually get some meaning... Stuff just work, until there are fundamental changes or compute power just becomes too small.
Will be interesting what will NVIDIA do, because if AMD get Xilinx and Intel already has Altera and it's 85% of the FPGA-world, and since NVIDIA is 'coexisting peacefully' with both...
Perhaps it's worth mentioning that Intel-guys aren't probably just sitting on their collective ass, they are using interposers, have announced big.little design (could it be that it's FPGA-related? this big.little stuff puzzles me greatly since announced), they could be more ready than we think they are...
One last thing - probably just a feeeeeeling, but whole stuff smells like DC spirit, anyhow something we won't see in quite a while (except those who works with datacenters and web-servers and whathaveyou)
The stuff that allows performance jumps like this:
Having a fpga chiplet to handle interference would allow normal calculations at the same time as specialized. Intel has Bfloat 16 support as one of their cpu extensions, VNNI
software.intel.com/content/www/us/en/develop/articles/introduction-to-intel-deep-learning-boost-on-second-generation-intel-xeon-scalable.html
It clearly shows a parallel - series chip that incorporates a FPGA. Later patents include ASICS, GPU just like the Ryzen chip.
This patent was shown to Microsoft and there was a 3 month engagement concerning its use in their project Olympus.
They took the split motherboard all accelerated design details outlined in the patent family and used it verbatim in their Xseries Xbox.
Jonathan Glickman is the real inventor and architect of the parallel - series Adaptable Computing Machine that can dynamically reconfigure itself to suit any compute job.
Terms like pipe-lining, cascading. child processing, streaming and so only all can be done with conventional computers which pass results from one node to another.
A parallel series Adaptable Cluster prefetches data inside an application embedded accelerated controller ( ala Netezza style ) and then passes the results to another application embedded accelerated controller be it a storage, network, Memory or GPU. That is exactly what the new Xseries Xbox is doing its really a super computer like no other, its not a gaming machine rather a gaming appliance in the same lineage as Netezza. Soon all PCs will utilize this technology.
patents.google.com/patent/US20140330999A1/en
This was shown to Microsoft, see if any of you can guess what happened next
When one accelerated component passes results to another accelerated component this is dubbed a series
connection which differs from pipelining, cascading, streaming... which are all done with conventional computers for ages now.
The advantage of this Parallel-Series all accelerated architecture is that is doesn't need to load balance or be elastic since it can balance within by reconfigure itself dynamically
Foxconn tried to design around and failed and guess who showed it to Microsoft
Also hardware acceleration can be done by embedding GPU, ASICs and other components.
Really any technology that allows for an application to be embedded into a device controller will work