Tuesday, July 9th 2024
AMD "Strix Halo" a Large Rectangular BGA Package the Size of an LGA1700 Processor
Apparently the AMD "Strix Halo" processor is real, and it's large. The chip is designed to square off against the likes of the Apple M3 Pro and M3 Max, in letting ultraportable notebooks have powerful graphics performance. A chiplet-based processor, not unlike the desktop socketed "Raphael," and mobile BGA "Dragon Range," the "Strix Halo" processor consists of one or two CCDs containing CPU cores, wired to a large die, that's technically the cIOD (client I/O die), but containing an oversized iGPU, and an NPU. The point behind "Strix Halo" is to eliminate the need for a performance-segment discrete GPU, and conserve its PCB footprint.
According to leaks by Harukaze5719, a reliable source with AMD leaks, "Strix Halo" comes in a BGA package dubbed FP11, measuring 37.5 mm x 45 mm, which is significantly larger than the 25 mm x 40 mm size of the FP8 BGA package that the regular "Strix Point," "Hawk Point," and "Phoenix" mobile processors are built on. It is larger in area than the 40 mm x 40 mm FL1 BGA package of "Dragon Range" and upcoming "Fire Range" gaming notebook processors. "Strix Halo" features one or two of the same 4 nm "Zen 5" CCDs featured on the "Granite Ridge" desktop and "Fire Range" mobile processors, but connected to a much larger I/O die, as we mentioned.At this point, the foundry node of the I/O die of "Strix Halo" is not known, but it's unlikely to be the same 6 nm node as the cIOD that AMD has been using on its other client processors based on "Zen 4" and "Zen 5." It wouldn't surprise us if AMD is using the same 4 nm node as it did for "Phoenix," for this I/O die. The main reason an advanced node is warranted, is because of the oversized iGPU, which features a whopping 20 workgroup processors (WGPs), or 40 compute units (CU), worth 2,560 stream processors, 80 AI accelerators, and 40 Ray accelerators. This iGPU is based on the latest RDNA 3.5 graphics architecture.
For perspective, the iGPU of the regular 4 nm "Strix Point" processor has 8 WGPs (16 CU, 1,024 stream processors). Then there's the NPU. AMD is expected to carry over the same 50 TOPS-capable XDNA 2 NPU it uses on the regular "Strix Point," on the I/O die of "Strix Halo," giving the processor Microsoft Copilot+ capabilities.
The memory interface of "Strix Halo" has for long been a mystery. Logic dictates that it's a terrible idea to have 16 "Zen 5" CPU cores and a 40-Compute Unit GPU share even a regular dual-channel DDR5 memory interface at the highest possible speeds, as both the CPU and iGPU would be severely bandwidth-starved. Then there's also the NPU to consider, as AI inferencing is a memory-sensitive application.
We have a theory, that besides an LPDDR5X interface for the CPU cores, the "Strix Halo" package has wiring for discrete GDDR6 memory. Even a relatively narrow 128-bit GDDR6 memory interface running at 20 Gbps would give the iGPU 320 GB/s of memory bandwidth, which is plenty for performance-segment graphics. This would mean that besides LPDDR5X chips, there would be four GDDR6 chips on the PCB. The iGPU even has 32 MB of on-die Infinity Cache memory, which seems to agree with our theory of a 128-bit GDDR6 interface exclusively for the iGPU.
Sources:
Harukaze_5719 (Twitter), Olrak29 (Twitter), VideoCardz
According to leaks by Harukaze5719, a reliable source with AMD leaks, "Strix Halo" comes in a BGA package dubbed FP11, measuring 37.5 mm x 45 mm, which is significantly larger than the 25 mm x 40 mm size of the FP8 BGA package that the regular "Strix Point," "Hawk Point," and "Phoenix" mobile processors are built on. It is larger in area than the 40 mm x 40 mm FL1 BGA package of "Dragon Range" and upcoming "Fire Range" gaming notebook processors. "Strix Halo" features one or two of the same 4 nm "Zen 5" CCDs featured on the "Granite Ridge" desktop and "Fire Range" mobile processors, but connected to a much larger I/O die, as we mentioned.At this point, the foundry node of the I/O die of "Strix Halo" is not known, but it's unlikely to be the same 6 nm node as the cIOD that AMD has been using on its other client processors based on "Zen 4" and "Zen 5." It wouldn't surprise us if AMD is using the same 4 nm node as it did for "Phoenix," for this I/O die. The main reason an advanced node is warranted, is because of the oversized iGPU, which features a whopping 20 workgroup processors (WGPs), or 40 compute units (CU), worth 2,560 stream processors, 80 AI accelerators, and 40 Ray accelerators. This iGPU is based on the latest RDNA 3.5 graphics architecture.
For perspective, the iGPU of the regular 4 nm "Strix Point" processor has 8 WGPs (16 CU, 1,024 stream processors). Then there's the NPU. AMD is expected to carry over the same 50 TOPS-capable XDNA 2 NPU it uses on the regular "Strix Point," on the I/O die of "Strix Halo," giving the processor Microsoft Copilot+ capabilities.
The memory interface of "Strix Halo" has for long been a mystery. Logic dictates that it's a terrible idea to have 16 "Zen 5" CPU cores and a 40-Compute Unit GPU share even a regular dual-channel DDR5 memory interface at the highest possible speeds, as both the CPU and iGPU would be severely bandwidth-starved. Then there's also the NPU to consider, as AI inferencing is a memory-sensitive application.
We have a theory, that besides an LPDDR5X interface for the CPU cores, the "Strix Halo" package has wiring for discrete GDDR6 memory. Even a relatively narrow 128-bit GDDR6 memory interface running at 20 Gbps would give the iGPU 320 GB/s of memory bandwidth, which is plenty for performance-segment graphics. This would mean that besides LPDDR5X chips, there would be four GDDR6 chips on the PCB. The iGPU even has 32 MB of on-die Infinity Cache memory, which seems to agree with our theory of a 128-bit GDDR6 interface exclusively for the iGPU.
40 Comments on AMD "Strix Halo" a Large Rectangular BGA Package the Size of an LGA1700 Processor
The NPU (Neural Processing Unit "AI engine" in Intel speak) is a different thing and a separate unit that can or cannot be included into the die design. AMD seems to be making a beeline for all of their "client" (end user) chips to have a separate NPU, they have started with the mobile line because the NPU can save a lot of battery power due to being able to do the same job as the GPU or CPU at much lower power consumption. This is not very critical on desktops, and many desktops also have more powerful dedicated GPU's and more powerful CPU's that can handle the AI workload (that MicroShaft etc envisage) so they can wait a while to get dedicated NPU's in future desktop CPU versions (Zen 6 I expect).
As for why I'd rather have this? I have specific needs where a dGPU would be a crutch, and building a new AM5 machine would be significantly more efficient in power draw, thermals, noise, physical space used (I find the current trend for 3 slots + 35cm cards to be absolutely disgusting), peripheral support, and it wouldn't be that much worse in cost either since selling my current AM4 machine would cover a large chunk of the cost.
4070 mobile (really more of a low-clocked 4060 Ti Super) more-or-less can/does, depending on which model you buy and how you use it.
This could (and should) be roughly equal (without) and better than all of those (with it), as perf is perf.
This could actually be a chip that hits that sweet-spot of better than most mobile 4070's and much, much, cheaper than a 4080 mobile (which is actually a cut-down 4070 desktop)...literally in the center and good-enough for general laptop (or even general [1080p60] PC) gaming...but it needs the BW....which the current LPDRR5x/cache simply cannot provide.
As I say, they can go at it with low clocks and high efficiency, and that's fine (as there is a market for that), but sub-optimal GPU perf is still sub-optimal GPU perf regardless.
I'm saying there is a market for what they COULD do, which IMO is the only reason you specifically make this chip. I'm sure a vanilla option will still will look nice wrt power/perf.
The thesis of adding GDDR6 clicked everything into place for me, as I had not even considered that possible.
In reality though, it makes perfect sense (for those willing to use a higher power envelope, just as those whom would buy a discrete 80-120w nVIDIA laptop GPU would do).
I'm simply saying before it looked like they were attempting to kill some small birds (other CPU/SoCs) with a very big stone because the it lacked bw to push it into competing with a discrete GPU; maybe take some market share from <4070 laptops...but they aren't for (most, non e-sport) gaming anyway.
Now it would appear they can kill multiple birds with one stone...and some GDDR6. They could/should be able to compete with, if not exceed the performance of 4070 laptops in a tangible way for less money.
People could actually have a decent 1080p60 SoC laptop I assume you are using 1.35v with your metric. Try using 1.1v. I don't know if any current products use Samsung @ 1.1v? I think most use Hynix @ 1.35v (and sometimes substitute in Samsung at 1.35v).
I'm not saying it's MORE efficient, I'm saying it's (potentially) not nearly as bad as you're implying, and the power/performance trade-off would be worth it for someone that was deciding between a productivity machine and a budget gaming laptop in a similar price range; especially versus something like a laptop with a 4070/4080 mobile inside of it (which would be much more expensive and/or likely use even more power).
I guess we'll just see how it goes?
It will be interesting to see how such setups without (or conceivingly with) GDDR6 match-up against competing solutions (both in productivity and gaming; iGPU/7600/4060/4070 laptops). I guess I'm just more optimistic this will be a low-cost good-enough option for many different kinds of people/markets vs their direct competition regardless of TDP configuration...although it will be interesting to see the power required to achieve parity with 4060/4070 mobile. That is indeed possible; unified pool or not there's always conceivably crossbar/HUMA.
I messed up though thinking it would be 128+128, not 256+128-bit. I don't know why I subtracted from the 'known' 256-bit LPDDR5x controller to add the possible GDDR6 controller. Whoops. :laugh:
Again, who knows....Like you say: leaks and rumors...maybe it could be 128+128 after-all. My (perhaps wrong) thinking was that 256-bit was known, but it was unknown 128-bit could be wired out to GDDR6.
Just making conversation and attempting conceivable projections and their use-cases. Never trying to proclaim infalability vs what might actually transpire.
The thing I never understood about a 256-bit LPDDR5x controller is...wouldn't that require 4 sticks of ram? That's pretty weird. Not impossible; just unconventional.
Strix Halo is a premium chip for premium windows laptops that will compete with the premium MacBook Pro models. It's above all a competitor to the M3 Max and probably M4 Max.
It's a great-all-around, no-cut-corners big SoC for laptops that has a capable GPU for gaming and GPU-accelerated tasks like video/image editors, a whopping 16-core Zen5 (no Zen5c BTW) for demanding multithreaded tasks like simulation and product development tools, a powerful ~50 TOPs NPU to run generative AI models and access to a truckload of RAM thanks to its 256bit width. It's everything at once.
In fact, there's more than 16 Zen5 cores in the solution. There's an additional 4 Zen5LP cores inside the I/O+GPU chip that consume very little power and clock very slowly, but take over the OS tasks while the system is idling. It's AMD's answer to Qualcomm's superior power efficiency on low demand loads, so that these premium windows laptops get the same 12-16h battery life on light usage.
So don't count on the full Strix Halo to appear on anything that isn't a premium laptop above $2500. As much as even I'd love to see ~$1000 gaming handhelds with a cut-down version of Strix Halo, the chip is going into laptops competing against the $4000 MBP M3 Max.
I do not know what difference this would make and whether or not some of those many traces could be used for video output instead of PCIe or whether it would require a whole new design. On that note, the current EPYC (and thus Threadripper) socket is physically capable of 12x DDR5 memory channels, restricted with Threadrippers to 8 and 6 channels, and FYI there is also the "SP6 socket" to look at, it (for now at least) uses the same physical IOD as EPYC and Threadripper chips but is in a smaller physical chip substrate, smaller socket, fewer memory channels and PCIe lanes and is (for now) restricted to 4x 15-core Zen 4c chiplets. This to me is a much closer basis for a new HEDT platform with reduced costs.
SP6 also shows that AMD is in a phase of serious expansion in all markets, specifically here "lower end servers" (and hopefully low end Threadrippers) that are currently limited to 64x Zen 4c cores and "only" 5 channels of DDR5 and 96 PCIe 5 lanes. Whatever socket that Strix Halo uses will be yet another new socket in a short period of time and IMHO a whole new family of products all using 256-bit RAM. As Strix Halo is going to be the first of a new line of products, it's success, it's pros and cons etc will all be scrutinised and no-doubt will be tested to the nth degree and people will find some interesting niches for this product and potential future avenues to aim towards if the need to tweak the socket for the 2nd generation (RAM will typically do that), but also whether Strix Point will be a product that then spawns it's own split in product lines, one towards HEDT/Threadripper that is affordable, and the other as a high performance SoC that does not require a dedicated GPU. Time will tell, and IMHO Strix Halo is my most anticipated product this year (even if it's delayed yet again and is released next year), specifically because it is essentially a whole new class of product, otherwise, Zen 6 is going to bring a minor revolution at the technical level and highlight technologies to come and direction of travel at the mass market desktop level.