Tuesday, July 9th 2024
AMD "Strix Halo" a Large Rectangular BGA Package the Size of an LGA1700 Processor
Apparently the AMD "Strix Halo" processor is real, and it's large. The chip is designed to square off against the likes of the Apple M3 Pro and M3 Max, in letting ultraportable notebooks have powerful graphics performance. A chiplet-based processor, not unlike the desktop socketed "Raphael," and mobile BGA "Dragon Range," the "Strix Halo" processor consists of one or two CCDs containing CPU cores, wired to a large die, that's technically the cIOD (client I/O die), but containing an oversized iGPU, and an NPU. The point behind "Strix Halo" is to eliminate the need for a performance-segment discrete GPU, and conserve its PCB footprint.
According to leaks by Harukaze5719, a reliable source with AMD leaks, "Strix Halo" comes in a BGA package dubbed FP11, measuring 37.5 mm x 45 mm, which is significantly larger than the 25 mm x 40 mm size of the FP8 BGA package that the regular "Strix Point," "Hawk Point," and "Phoenix" mobile processors are built on. It is larger in area than the 40 mm x 40 mm FL1 BGA package of "Dragon Range" and upcoming "Fire Range" gaming notebook processors. "Strix Halo" features one or two of the same 4 nm "Zen 5" CCDs featured on the "Granite Ridge" desktop and "Fire Range" mobile processors, but connected to a much larger I/O die, as we mentioned.At this point, the foundry node of the I/O die of "Strix Halo" is not known, but it's unlikely to be the same 6 nm node as the cIOD that AMD has been using on its other client processors based on "Zen 4" and "Zen 5." It wouldn't surprise us if AMD is using the same 4 nm node as it did for "Phoenix," for this I/O die. The main reason an advanced node is warranted, is because of the oversized iGPU, which features a whopping 20 workgroup processors (WGPs), or 40 compute units (CU), worth 2,560 stream processors, 80 AI accelerators, and 40 Ray accelerators. This iGPU is based on the latest RDNA 3.5 graphics architecture.
For perspective, the iGPU of the regular 4 nm "Strix Point" processor has 8 WGPs (16 CU, 1,024 stream processors). Then there's the NPU. AMD is expected to carry over the same 50 TOPS-capable XDNA 2 NPU it uses on the regular "Strix Point," on the I/O die of "Strix Halo," giving the processor Microsoft Copilot+ capabilities.
The memory interface of "Strix Halo" has for long been a mystery. Logic dictates that it's a terrible idea to have 16 "Zen 5" CPU cores and a 40-Compute Unit GPU share even a regular dual-channel DDR5 memory interface at the highest possible speeds, as both the CPU and iGPU would be severely bandwidth-starved. Then there's also the NPU to consider, as AI inferencing is a memory-sensitive application.
We have a theory, that besides an LPDDR5X interface for the CPU cores, the "Strix Halo" package has wiring for discrete GDDR6 memory. Even a relatively narrow 128-bit GDDR6 memory interface running at 20 Gbps would give the iGPU 320 GB/s of memory bandwidth, which is plenty for performance-segment graphics. This would mean that besides LPDDR5X chips, there would be four GDDR6 chips on the PCB. The iGPU even has 32 MB of on-die Infinity Cache memory, which seems to agree with our theory of a 128-bit GDDR6 interface exclusively for the iGPU.
Sources:
Harukaze_5719 (Twitter), Olrak29 (Twitter), VideoCardz
According to leaks by Harukaze5719, a reliable source with AMD leaks, "Strix Halo" comes in a BGA package dubbed FP11, measuring 37.5 mm x 45 mm, which is significantly larger than the 25 mm x 40 mm size of the FP8 BGA package that the regular "Strix Point," "Hawk Point," and "Phoenix" mobile processors are built on. It is larger in area than the 40 mm x 40 mm FL1 BGA package of "Dragon Range" and upcoming "Fire Range" gaming notebook processors. "Strix Halo" features one or two of the same 4 nm "Zen 5" CCDs featured on the "Granite Ridge" desktop and "Fire Range" mobile processors, but connected to a much larger I/O die, as we mentioned.At this point, the foundry node of the I/O die of "Strix Halo" is not known, but it's unlikely to be the same 6 nm node as the cIOD that AMD has been using on its other client processors based on "Zen 4" and "Zen 5." It wouldn't surprise us if AMD is using the same 4 nm node as it did for "Phoenix," for this I/O die. The main reason an advanced node is warranted, is because of the oversized iGPU, which features a whopping 20 workgroup processors (WGPs), or 40 compute units (CU), worth 2,560 stream processors, 80 AI accelerators, and 40 Ray accelerators. This iGPU is based on the latest RDNA 3.5 graphics architecture.
For perspective, the iGPU of the regular 4 nm "Strix Point" processor has 8 WGPs (16 CU, 1,024 stream processors). Then there's the NPU. AMD is expected to carry over the same 50 TOPS-capable XDNA 2 NPU it uses on the regular "Strix Point," on the I/O die of "Strix Halo," giving the processor Microsoft Copilot+ capabilities.
The memory interface of "Strix Halo" has for long been a mystery. Logic dictates that it's a terrible idea to have 16 "Zen 5" CPU cores and a 40-Compute Unit GPU share even a regular dual-channel DDR5 memory interface at the highest possible speeds, as both the CPU and iGPU would be severely bandwidth-starved. Then there's also the NPU to consider, as AI inferencing is a memory-sensitive application.
We have a theory, that besides an LPDDR5X interface for the CPU cores, the "Strix Halo" package has wiring for discrete GDDR6 memory. Even a relatively narrow 128-bit GDDR6 memory interface running at 20 Gbps would give the iGPU 320 GB/s of memory bandwidth, which is plenty for performance-segment graphics. This would mean that besides LPDDR5X chips, there would be four GDDR6 chips on the PCB. The iGPU even has 32 MB of on-die Infinity Cache memory, which seems to agree with our theory of a 128-bit GDDR6 interface exclusively for the iGPU.
40 Comments on AMD "Strix Halo" a Large Rectangular BGA Package the Size of an LGA1700 Processor
Also earlier rumors said the memory was DDR5 256bit.
And since Ngreedia is getting away with murder (pricing wise) with the halo 4090, I wonder if AMD is going for the same “motif” given the Halo word in the name?
I hope is not strictly a gaming device and instead also used in a more professional segments/devices. Or at the very least, in a mini pc, from Minisforums, for example.
Edit: Oh the article is saying 128-bit GDDR6 not LPDDRX. Now that would be cool.
That's still a very CPU-weighted config for gamers who honestly won't want to overpay for 8-10 cores they'll never use, and most likely the only configuration that will include the full-fat 40CU GPU component.
The sensible Ryzen7 or Ryzen5 variants that have 6-8 cores games need will likely come with crippled 32CU or 28CU GPUs in them which is acceptable, but not worth much - 6600S (28CU RDNA2 dGPU) laptops were occupying the sub-$1000 entry-level bargain bin 18 months ago. They're fine for casual gaming and esports but hardly what I'd call bleeding edge and already struggling in plenty of modern titles at 1080p.
Non-gamers likely aren't interested because no matter how good AMD's GPU compute performance is, they don't use CUDA which is a massive gatekeeper for the entire productivity industry, and 4060 laptops are cheap, even available in thin-and-lights that pull a mere 120W from the wall outlet. I don't see Strix Halo competing well with those, especially since the slides here indicate a potential 175W power draw so that's definitely not going to be a thin-and-light laptop.
Let's see how Strix Halo performs and which products are built around it. I have my own ideas of how a fat CPU/GPU SoC can be used but the market is way more creative.
With more than double the CU's and TDP headroom it's easily a 4050-4060-5050 competitor.
Strix Point is more for thin laptop/mini PC. Strix Halo would suit desktop a lot more. Within one day you would go shopping for the biggest cooler.
There's no GDDR6 there, but there's 32MB Infinity Cache for the iGPU. Why? It'll probably be cheaper and faster to get a 12/16-core Ryzen 9 with a discrete GPU. Especially if you wait for RDNA4. It's expected to have RTX 4070 Laptop performance (desktop RTX 4060 Ti chip) but without the 8GB VRAM limitation.
In fact, Strix Halo is probably only going to appear in expensive laptops with 32GB LPDDR5X or more, as there's been shipping manifestos with Strix Halo test units carrying 128GB RAM.
/s
I simply dont understand the automatic hate that all AMD products get, accompanied by false statements.
If you put all of this together you soon find out why AMD ended up with the design they did that isn't going to be insanely expensive, so will actually end up with mass market appeal whilst doing a solid job of making a product that is designed specifically to not have a discrete GPU alongside it, thus eliminating the sale of a nVidia GPU, and making a product that beats anything intel has to offer.
This is looking like a good product to me, and much as I had assumed already, leaks are suggesting more and more that this is the first in a whole line of products.! The 265-bit bus laptop/desktop CPU/APU's are just over the horizon and IMHO, this is why Strix Halo is the product I am most interested in seeing this year, not least because of some very interesting use cases that look very promising, what OEM's will do with it, what mini-desktop PC's will look like on the inside, what the public want to see from version 2, and ultimately where AMD decides to take version 2, 3, etc.
I almost forgot to say that there will be variants that have fewer than 16 cores and fewer than 40CU's of GPU performance and yes there are already lots of people calling for a single CCD version ideally with 3D V-Cache if that is possible and a fully enabled 40CU GPU because that would be fantastic for gaming, but others are looking for a 256GB Strix Halo laptop fully enabled (16c 40CU) because they simply need it.
Also remember that this is essentially a new market and AMD has some choices to make, no doubt they are even reading comments like this in forums to get an idea of what people want and what people expect. As much as people (myself included) often laugh at marketing, it is important to launch the right product in the right segment at the right price, to keep customers happy and buying by providing the products that people actually want which is rarely the top models.
The number of things that use OpenCL to any success these days is dwindling by the day, and ROCm support is a noble attempt but its influence so far on the software market is somewhere between "absolute zero" and "too small to measure". It's why Nvidia is now the most valuable company on earth, bar none. I certainly don't like that fact, but it's the undeniable truth.
I kept wondering how this product makes any sense, given the bw is so low and the cache does not appear to have changed versus RDNA3.
FWIW, I really like the sporadic "TPU has a theory" last paragraph inclusion that has been included in some news articles as of late.
If it were me, I too would include a last paragraph with an editorial (perhaps italicized and/or with an asterisk). I think this is very good writing that encompasses both available info and what we know it needs.
Thanks for the personal insight (along with the news).
Don't be afraid to keep writing the correctly-compartmentalized editorials! This is, ofc, (a very large part of) what makes TPU special.
TLDR: Keep up the good work, btarunr (in both regards to news and analysis), and don't be afraid to continue to show us that our News Guy actually has an innate understanding of the stuff they are reporting.
:lovetpu:
Edit: was using 7600 math (2048sp) in former calculation (~2.8ghz), not 2560sp. Wasn't completely awake yet when I wrote that (or even as I write this, for that matter).
Erased that JIC anyone caught it. :laugh:
Still, the GDDR/LPDDR theoretical split for bw makes sense wrt VERY efficient clocks at perfect yield (2560sp) or higher-clocked (but still on the voltage/yield curve) using a lesser-enabled part (such as 2048sp).
I really should have had another cup of coffee before writing anything; apologies for my blunder. :oops:
One thing I have been curious about, with this path AMD has been going down. Why have GPU AI accelerators and a dedicated NPU? Is the space the GPU accelerators take up mostly insignificant? What kind of capability overlap do they have? What makes them unique?
RDNA 3.5 is being used in packages like this and won't be offered on a dedicated discrete card where those AI accelerators make sense, as you'd likely not have an NPU. It seems like if your package has an NPU, you could have designed RDNA 3.5 to not have those AI accelerators at all. But AMD chose to leave them there for a reason. I wonder what that reasoning is.
Just forget I said anything.
I'm feeling pretty foolish at the moment (outside the commending the observation of possible GDDR6).
I usually think about things a lot before I post; I don't know what I was thinking with that one. I would just delete it, but won't hide that everyone makes mistakes, myself included..
AMD would probably want to minimize the die size & having multiple memory interfaces supported will do the opposite, the only way GDDR6 support is plausible is if any of these chips go into a console or something!
Certainly the difference between playing a game at 1080p60 or not.
Also, if 4 (16Gb/2GB) chips...8GB. That's what, like ~8-12W? I would take that trade-off 100% of the time, personally.
273gbps with just LPDDR5x; that's not even enough bandwidth for a desktop 7600 (given the similar cache) alone, ntm 7600 was always borderline-acceptable as a contemporary GPU; less-so moving forward.
By all accounts the mobile N33 was a failure (for not meeting even close to that standard); hence why we got the 7600XT (16GB) on desktop; why would they settle for trying to attract the same failed market?
I have to imagine this thing was created to keep pace with (at least) the PS5 (or in laptop GPU terms; at least a 4070 mobile), at least as an *option*.
They could run very low clocks w/o it and that's all fine and good wrt power or competing with Intel...but not when versus a contemporary discrete GPU; most generally target at least that metric.
Add to this...Navi 32 was never released as a mobile part AFAIK. I wonder why that could be...IMO probably because it would be a more-efficient (if-expensive) option than this.
That's why this part makes very little sense to me without some kind of additional bandwidth option....why create something so large if not to do battle with something like a 4070 mobile (and win)?
Besides the obvious reasons, a 'sideport' (yes, I know it's not exactly the same thing) of GDDR6 would make sense for a host of reasons, some that are in that article from 20 years ago.