• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Apple Patents Multi-Level Hybrid Memory Subsystem

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,729 (1.01/day)
Apple has today patented a new approach to how it uses memory in the System-on-Chip (SoC) subsystem. With the announcement of the M1 processor, Apple has switched away from the traditional Intel-supplied chips and transitioned into a fully custom SoC design called Apple Silicon. The new designs have to integrate every component like the Arm CPU and a custom GPU. Both of these processors need good memory access, and Apple has figured out a solution to the problem of having both the CPU and the GPU accessing the same pool of memory. The so-called UMA (unified memory access) represents a bottleneck because both processors share the bandwidth and the total memory capacity, which would leave one processor starving in some scenarios.

Apple has patented a design that aims to solve this problem by combining high-bandwidth cache DRAM as well as high-capacity main DRAM. "With two types of DRAM forming the memory system, one of which may be optimized for bandwidth and the other of which may be optimized for capacity, the goals of bandwidth increase and capacity increase may both be realized, in some embodiments," says the patent, " to implement energy efficiency improvements, which may provide a highly energy-efficient memory solution that is also high performance and high bandwidth." The patent got filed way back in 2016 and it means that we could start seeing this technology in the future Apple Silicon designs, following the M1 chip.

Update 21:14 UTC: We have been reached out by Mr. Kerry Creeron, an attorney with the firm of Banner & Witcoff, who provided us with additional insights about the patent. Mr. Creeron has provided us with his personal commentary about it, and you can find Mr. Creeron's quote below.



Kerry Creeron—an attorney with the firm of Banner & Witcoff. said:
High-level, the patent covers a memory system having a cache DRAM and that is coupled to a main DRAM. The cache DRAM is less dense and has lower energy consumption than the main DRAM. The cache DRAM may also have higher performance. A variety of different layouts are illustrated for connecting the main and cache DRAM ICs, e.g. in FIGS. 8-13. One interesting layout involves through silicon vias (TSVs) that pass through a stack of main DRAM memory chips.

Theoretically, such layouts might be useful for adding additional slower DRAM to Apple's M1 chip architecture.

Finally, I note that the lead inventor, Biswas, was with PA Semi before Apple Acquired it.

View at TechPowerUp Main Site
 

TheLostSwede

News Editor
Joined
Nov 11, 2004
Messages
18,013 (2.44/day)
Location
Sweden
System Name Overlord Mk MLI
Processor AMD Ryzen 7 7800X3D
Motherboard Gigabyte X670E Aorus Master
Cooling Noctua NH-D15 SE with offsets
Memory 32GB Team T-Create Expert DDR5 6000 MHz @ CL30-34-34-68
Video Card(s) Gainward GeForce RTX 4080 Phantom GS
Storage 1TB Solidigm P44 Pro, 2 TB Corsair MP600 Pro, 2TB Kingston KC3000
Display(s) Acer XV272K LVbmiipruzx 4K@160Hz
Case Fractal Design Torrent Compact
Audio Device(s) Corsair Virtuoso SE
Power Supply be quiet! Pure Power 12 M 850 W
Mouse Logitech G502 Lightspeed
Keyboard Corsair K70 Max
Software Windows 10 Pro
Benchmark Scores https://valid.x86.fr/yfsd9w
I thought the M1 already did something like this.
 
Joined
Jan 8, 2017
Messages
9,603 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Broadwell had basically the same thing, except the DRAM was integrated on the chip. I am not sure how worth it is to have it on the same package instead. Not every good if you're trying to save power either vs having it integrated.

So it seems to me like this is just a worse version of what Intel did with their L4 cache.
 

bug

Joined
May 22, 2015
Messages
13,948 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
So... Apple just invented cached RAM access?
 
Joined
May 30, 2015
Messages
1,958 (0.56/day)
Location
Seattle, WA
Broadwell had basically the same thing, except the DRAM was integrated on the chip. I am not sure how worth it is to have it on the same package instead. Not every good if you're trying to save power either vs having it integrated.

So it seems to me like this is just a worse version of what Intel did with their L4 cache.

Haswell and Broadwell eDRAM L4 was also its own die wired through the substrate. They did this to reduce manufacturing complexity and offer two zones of clock control for performance binning. External tiers of cache are nothing new, neither is stacked caches. This is something that goes back decades. What is new is the order of cache access, landing size, wire latency and bandwidth.

1611752782840.png
 
Joined
Mar 3, 2020
Messages
127 (0.07/day)
Location
Australia
System Name wasted talent
Processor i5-11400F
Motherboard Gigabyte B560M Aorussy Pro
Cooling Silverstone AR12
Memory Patriot Viper Steel 2X8 4400 @ 3600 C14,14,12,28
Video Card(s) Sapphire RX 6700 Pulse, Galax 1650 Super EX
Storage Kingston A2000 500GB
Display(s) Gigabyte M27Q
Case open mATX: zwzdiy.cc/M/Product/209574419.html
Audio Device(s) HiFiMan HE400SE
Power Supply Strix Gold 650W
Mouse Skoll Mini, G502 LightSpeed
Keyboard Akko 3084S
Software 1809 LTSC
Benchmark Scores 3968/540 CB R20 MT/ST
Haswell and Broadwell eDRAM L4 was also its own die wired through the substrate. They did this to reduce manufacturing complexity and offer two zones of clock control for performance binning. External tiers of cache are nothing new, neither is stacked caches. This is something that goes back decades. What is new is the order of cache access, landing size, wire latency and bandwidth.

View attachment 185800
Just wonder how it compares to EDRAM. But not expecting this to be order of magnitude faster. Should actually provide worthwhile speed boost compared to EDRAM at least, putting it on SOC die(?) instead of separate EDRAM die.
 
Joined
Jul 7, 2019
Messages
954 (0.47/day)
I wonder if AMD will eventually implement similar for their own APUs, but maybe with HBM instead, or maybe even on the board again like the old Phenom-era mobo cache; except where the HBM could be used either by the CPU or the GPU as-needed.
 
Joined
Feb 8, 2020
Messages
19 (0.01/day)
Location
Ottawa, Canada
System Name La Machina
Processor AMD Ryzen 2700
Motherboard ASUS B450 TUF mATX
Cooling EVO 212
Memory Corsair 3200MHz CL16
Video Card(s) RX 560
Storage Some SSD here, some old spinning stuff there
Display(s) 4k Samsung TV and an Asus Pro Art 231
Case Some microatx Antec
Audio Device(s) ASUS Essence STX
Power Supply Seasonic 600W maybe?
I wouldn't expect this to be a real performance boosting design (maybe between UMA and some wire reduced latency?), but rather a change up in the bill of materials and manufacturing approaches. This is like Intel's recent efforts in a lot of ways.

This is also in the context of Big-Small core setup. So, there may be some in-house special sauce that calls to optimizing that architecture setup against an on-package bit of DRAM.

Throw on some degree of MRAM tech or whatever to provide some quasi speedy permanent cache on the package... Really start veering off that "it's either L$, RAM or on the disk" dogma, and incorporate layers of software priority for OS and applications to help further optimize latency... That could get interesting for reducing the PCB foot print of these devices.
 
Joined
Feb 1, 2013
Messages
1,276 (0.29/day)
System Name Gentoo64 /w Cold Coffee
Processor 9900K 5.2GHz @1.312v
Motherboard MXI APEX
Cooling Raystorm Pro + 1260mm Super Nova
Memory 2x16GB TridentZ 4000-14-14-28-2T @1.6v
Video Card(s) RTX 4090 LiquidX Barrow 3015MHz @1.1v
Storage 660P 1TB, 860 QVO 2TB
Display(s) LG C1 + Predator XB1 QHD
Case Open Benchtable V2
Audio Device(s) SB X-Fi
Power Supply MSI A1000G
Mouse G502
Keyboard G815
Software Gentoo/Windows 10
Benchmark Scores Always only ever very fast
Tacking on eDRAM to a fast x86 processor is one thing, but adding it to a slow ARM chip is quite another.
 
Joined
Feb 11, 2009
Messages
5,625 (0.97/day)
System Name Cyberline
Processor Intel Core i7 2600k -> 12600k
Motherboard Asus P8P67 LE Rev 3.0 -> Gigabyte Z690 Auros Elite DDR4
Cooling Tuniq Tower 120 -> Custom Watercoolingloop
Memory Corsair (4x2) 8gb 1600mhz -> Crucial (8x2) 16gb 3600mhz
Video Card(s) AMD RX480 -> RX7800XT
Storage Samsung 750 Evo 250gb SSD + WD 1tb x 2 + WD 2tb -> 2tb MVMe SSD
Display(s) Philips 32inch LPF5605H (television) -> Dell S3220DGF
Case antec 600 -> Thermaltake Tenor HTCP case
Audio Device(s) Focusrite 2i4 (USB)
Power Supply Seasonic 620watt 80+ Platinum
Mouse Elecom EX-G
Keyboard Rapoo V700
Software Windows 10 Pro 64bit
Tacking on eDRAM to a fast x86 processor is one thing, but adding it to a slow ARM chip is quite another.

but what about adding it to a fast ARM chip? like is the case here?
 
Joined
Feb 1, 2013
Messages
1,276 (0.29/day)
System Name Gentoo64 /w Cold Coffee
Processor 9900K 5.2GHz @1.312v
Motherboard MXI APEX
Cooling Raystorm Pro + 1260mm Super Nova
Memory 2x16GB TridentZ 4000-14-14-28-2T @1.6v
Video Card(s) RTX 4090 LiquidX Barrow 3015MHz @1.1v
Storage 660P 1TB, 860 QVO 2TB
Display(s) LG C1 + Predator XB1 QHD
Case Open Benchtable V2
Audio Device(s) SB X-Fi
Power Supply MSI A1000G
Mouse G502
Keyboard G815
Software Gentoo/Windows 10
Benchmark Scores Always only ever very fast
but what about adding it to a fast ARM chip? like is the case here?
3.2GHz core isn't exactly fast by today's standards, but what is its actual non-core speed that feeds the cache? When Apple/Intel first introduced it on Broadwell, core and uncore were running well in excess of 4GHz.
 
Joined
Jul 10, 2017
Messages
2,671 (0.97/day)
I wonder if AMD will eventually implement similar for their own APUs, but maybe with HBM instead, or maybe even on the board again like the old Phenom-era mobo cache; except where the HBM could be used either by the CPU or the GPU as-needed.
I'm sure it never ocurred to the engineers at AMD.
 
Joined
Nov 4, 2005
Messages
12,056 (1.72/day)
System Name Compy 386
Processor 7800X3D
Motherboard Asus
Cooling Air for now.....
Memory 64 GB DDR5 6400Mhz
Video Card(s) 7900XTX 310 Merc
Storage Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s) 55" Samsung 4K HDR
Audio Device(s) ATI HDMI
Mouse Logitech MX518
Keyboard Razer
Software A lot.
Benchmark Scores Its fast. Enough.
So like socket Super 7 in the late 90s that had on board cache Apple is now doing it?


I wonder who they are going to parent troll with this as well.....
 
Joined
Mar 21, 2016
Messages
2,512 (0.78/day)
So... Apple just invented cached RAM access?
They modified what others invented marginally to avoid lawsuits and patented that as their very own cutting edge innovation that will cost a torso because a arm and leg is no longer enough. AMD should just replace the DRAM on the substrate with a FPGA and patent that. They can look into how or why to utilize it after. Hey look we did a thing do too.
 
Last edited:
Joined
Oct 15, 2011
Messages
2,543 (0.52/day)
Location
Springfield, Vermont
System Name KHR-1
Processor Ryzen 9 5900X
Motherboard ASRock B550 PG Velocita (UEFI-BIOS P3.40)
Memory 32 GB G.Skill RipJawsV F4-3200C16D-32GVR
Video Card(s) Sparkle Titan Arc A770 16 GB
Storage Western Digital Black SN850 1 TB NVMe SSD
Display(s) Alienware AW3423DWF OLED-ASRock PG27Q15R2A (backup)
Case Corsair 275R
Audio Device(s) Technics SA-EX140 receiver with Polk VT60 speakers
Power Supply eVGA Supernova G3 750W
Mouse Logitech G Pro (Hero)
Software Windows 11 Pro x64 23H2
I wonder if AMD will eventually implement similar for their own APUs, but maybe with HBM instead, or maybe even on the board again like the old Phenom-era mobo cache; except where the HBM could be used either by the CPU or the GPU as-needed.
FTR, socket 462 was their last FSB desktop platform! AMD is well known for moving stuff to on-die, before Intel. For Intel, OTOH, they still did FSB until the first-gen Core i-series.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,173 (2.78/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
This feels like just another level in the memory hierarchy. I had always felt that the CPU needed something between the last level of on-die CPU cache and system memory. HBM seemed like it would be the best option for this kind of thing because of its memory density and fairly low power usage and the on-package memory that Apple uses seems to be a lot like that. So it makes a boatload of sense that there would be a need for something between flash memory, which is already ungodly fast, and DRAM that's further away from the CPU/GPU/NPU/XYZPU. Imagine having 16GB of "really fast" memory, but having 128GB of "still pretty fast" memory to support it? If the caching strategy is sound, performance is good, and it's as good as having 144GB of memory, then that sounds pretty damn sexy if you ask me.

With that said though, as the owner of a top of the line MacBook Pro that's expensive as a kidney, I know that Apple will absolutely make you pay through the nose for it.

So like socket Super 7 in the late 90s that had on board cache Apple is now doing it?
The Apple G3 and G4 had an off die cache on the same package as the CPU. It's nothing new. It just makes sense to have cache closer to the actual cores using it for the sake of latency. It depends on where in the memory hierarchy you need improvement.

Edit: I mean, what does this look like? I used one of these when StarCraft was brand new. This isn't new tech. We just have better tech to use it with.
1611795654388.png
 
Last edited:
Joined
Dec 26, 2006
Messages
3,901 (0.59/day)
Location
Northern Ontario Canada
Processor Ryzen 5700x
Motherboard Gigabyte X570S Aero G R1.1 BiosF5g
Cooling Noctua NH-C12P SE14 w/ NF-A15 HS-PWM Fan 1500rpm
Memory Micron DDR4-3200 2x32GB D.S. D.R. (CT2K32G4DFD832A)
Video Card(s) AMD RX 6800 - Asus Tuf
Storage Kingston KC3000 1TB & 2TB & 4TB Corsair MP600 Pro LPX
Display(s) LG 27UL550-W (27" 4k)
Case Be Quiet Pure Base 600 (no window)
Audio Device(s) Realtek ALC1220-VB
Power Supply SuperFlower Leadex V Gold Pro 850W ATX Ver2.52
Mouse Mionix Naos Pro
Keyboard Corsair Strafe with browns
Software W10 22H2 Pro x64
Looks like Rambus is at it again with the patents...............oh wait ;)
 

bug

Joined
May 22, 2015
Messages
13,948 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
This feels like just another level in the memory hierarchy. I had always felt that the CPU needed something between the last level of on-die CPU cache and system memory. HBM seemed like it would be the best option for this kind of thing because of its memory density and fairly low power usage and the on-package memory that Apple uses seems to be a lot like that. So it makes a boatload of sense that there would be a need for something between flash memory, which is already ungodly fast, and DRAM that's further away from the CPU/GPU/NPU/XYZPU. Imagine having 16GB of "really fast" memory, but having 128GB of "still pretty fast" memory to support it? If the caching strategy is sound, performance is good, and it's as good as having 144GB of memory, then that sounds pretty damn sexy if you ask me.

With that said though, as the owner of a top of the line MacBook Pro that's expensive as a kidney, I know that Apple will absolutely make you pay through the nose for it.
Caches need to be fast (low-latency) and they are power hungry because of that. HBM is kinda the opposite of that.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,173 (2.78/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Caches need to be fast (low-latency) and they are power hungry because of that. HBM is kinda the opposite of that.
Latency gets worse the further away from the compute cores you go when it comes to the memory hierarchy. I'm not suggesting another level of external cache because SRAM is expensive and runs hot. I'm suggesting that this is adding a level to the memory hierarchy between cache and system memory. It doesn't need to have latency better than cache, just better than system memory while maintaining high bandwidth.
 

bug

Joined
May 22, 2015
Messages
13,948 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Latency gets worse the further away from the compute cores you go when it comes to the memory hierarchy. I'm not suggesting another level of external cache because SRAM is expensive and runs hot. I'm suggesting that this is adding a level to the memory hierarchy between cache and system memory. It doesn't need to have latency better than cache, just better than system memory while maintaining high bandwidth.
Of course, but I was addressing this:
HBM seemed like it would be the best option for this kind of thing because of its memory density and fairly low power usage and the on-package memory that Apple uses seems to be a lot like that.
 
Joined
Jan 3, 2021
Messages
3,719 (2.51/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Caches need to be fast (low-latency) and they are power hungry because of that. HBM is kinda the opposite of that.
Caches need to be low-latency but how does Apple's "cache DRAM" fit in here? What technology could it be based on if it's expected to offer cosiderably lower latency than the main DRAM?
 

bug

Joined
May 22, 2015
Messages
13,948 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Caches need to be low-latency but how does Apple's "cache DRAM" fit in here? What technology could it be based on if it's expected to offer cosiderably lower latency than the main DRAM?
If it's not mentioned in the patent itself, we may never know. It could be either a form of cheaper SRAM or some plain DDR that lower latency just by the virtue of being on the same die.

Edit: Lo and behold, the update says it's just DDR.
 
Last edited:

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,173 (2.78/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Caches need to be low-latency but how does Apple's "cache DRAM" fit in here? What technology could it be based on if it's expected to offer cosiderably lower latency than the main DRAM?
Remember that little thing called Crystal Well that Intel made for certain Broadwell chips; an eDRAM cache paired with the more beefy iGPUs? Think that, but with more capacity. While electricity travels pretty fast, a lot of latency is introduced by the length of the circuit. Having something like stacked DRAM really close to the CPU can offer better latency than that of DIMMs that are physically much further away. This is why I think something like HBM or stacked DRAM on the same package as the CPU could act as a level in the memory hierarchy that's between the last level of CPU cache and system memory.
 
Top