Thursday, January 28th 2021

Apple Patents Multi-Level Hybrid Memory Subsystem

Apple has today patented a new approach to how it uses memory in the System-on-Chip (SoC) subsystem. With the announcement of the M1 processor, Apple has switched away from the traditional Intel-supplied chips and transitioned into a fully custom SoC design called Apple Silicon. The new designs have to integrate every component like the Arm CPU and a custom GPU. Both of these processors need good memory access, and Apple has figured out a solution to the problem of having both the CPU and the GPU accessing the same pool of memory. The so-called UMA (unified memory access) represents a bottleneck because both processors share the bandwidth and the total memory capacity, which would leave one processor starving in some scenarios.

Apple has patented a design that aims to solve this problem by combining high-bandwidth cache DRAM as well as high-capacity main DRAM. "With two types of DRAM forming the memory system, one of which may be optimized for bandwidth and the other of which may be optimized for capacity, the goals of bandwidth increase and capacity increase may both be realized, in some embodiments," says the patent, " to implement energy efficiency improvements, which may provide a highly energy-efficient memory solution that is also high performance and high bandwidth." The patent got filed way back in 2016 and it means that we could start seeing this technology in the future Apple Silicon designs, following the M1 chip.

Update 21:14 UTC: We have been reached out by Mr. Kerry Creeron, an attorney with the firm of Banner & Witcoff, who provided us with additional insights about the patent. Mr. Creeron has provided us with his personal commentary about it, and you can find Mr. Creeron's quote below.

Kerry Creeron—an attorney with the firm of Banner & Witcoff.High-level, the patent covers a memory system having a cache DRAM and that is coupled to a main DRAM. The cache DRAM is less dense and has lower energy consumption than the main DRAM. The cache DRAM may also have higher performance. A variety of different layouts are illustrated for connecting the main and cache DRAM ICs, e.g. in FIGS. 8-13. One interesting layout involves through silicon vias (TSVs) that pass through a stack of main DRAM memory chips.

Theoretically, such layouts might be useful for adding additional slower DRAM to Apple's M1 chip architecture.

Finally, I note that the lead inventor, Biswas, was with PA Semi before Apple Acquired it.
Source: Apple Patent
Add your own comment

41 Comments on Apple Patents Multi-Level Hybrid Memory Subsystem

#26
lexluthermiester
AleksandarKUpdate 21:14 UTC: We have been reached out by Mr. Kerry Creeron, an attorney with the firm of Banner & Witcoff, who provided us with additional insights about the patent. Mr. Creeron has provided us with his personal commentary about it, and you can find Mr. Creeron's quote below.
I suspect that this patent if approved will, in short order, will be contested and get invalidated. Memory schemes like this have been in use for decades and Apple's very minor "spin" on the concept is not enough for a patent to withstand critical scrutiny. This is Apple literally trying to be a patent troll.
Posted on Reply
#27
InVasMani
Just put a 3D stacked DRAM chip under neath the CPU socket in the center tired to the socket and CPU directly.
lexluthermiesterI suspect that this patent if approved will, in short order, will be contested and get invalidated. Memory schemes like this have been in use for decades and Apple's very minor "spin" on the concept is not enough for a patent to withstand critical scrutiny. This is Apple literally trying to be a patent troll.
I agree this type of patent is bad for a level playing field of open competition. We've seen how this works with RAMBUS already it doesn't benefit consumers.
Posted on Reply
#28
Aquinus
Resident Wat-man
InVasManiJust put a 3D stacked DRAM chip under neath the CPU socket in the center tired to the socket and CPU directly.
DRAM can't be put on the interposer for the CPU if it's under the CPU. It would have to be mounted to the PCB under the interposer and would make for a very complicated PCB design. I wouldn't envy that engineer's task. It's also not that much closer to the CPU compared to putting it next to it like with M1's system memory. There are a lot of cons and not a lot of benefits.
Posted on Reply
#29
InVasMani
That's probably true and valid, but things are shrinking it'll get easier to place it there in due time. Also it's not to replace system memory more to provide a quicker buffer between it. I was speaking about wiring it under the motherboard socket as opposed to the underneath the middle of the CPU's PCB. A PCIE wired microSD card slot would be neat there as well. Consider this 3D stacked and 2TB of storage with PCIE 4.0 x16 slot wiring to it. If they could pull that off it would be rather amazing.
Posted on Reply
#30
RJARRRPCGP
lexluthermiesterI suspect that this patent if approved will, in short order, will be contested and get invalidated. Memory schemes like this have been in use for decades and Apple's very minor "spin" on the concept is not enough for a patent to withstand critical scrutiny. This is Apple literally trying to be a patent troll.
So, Apple is being accused of having another baseless patent, likened to round corners?
Posted on Reply
#31
pjl321
I would have thought moving to stacked HBM would be a much better option and resolve these issues, Apple could easily have 128GB or more of unified memory on the package at around 4TB/s bandwidth (maybe 8TB/s), then have a PCI-Ex 5.0 memory interface to ultra fast SSDs which could have terabytes of memory accessed faster than your average DDR4!
Posted on Reply
#32
Aquinus
Resident Wat-man
pjl321I would have thought moving to stacked HBM would be a much better option and resolve these issues, Apple could easily have 128GB or more of unified memory on the package at around 4TB/s bandwidth (maybe 8TB/s), then have a PCI-Ex 5.0 memory interface to ultra fast SSDs which could have terabytes of memory accessed faster than your average DDR4!
The complication with HBM is that it's slow but wide interface. There is a benefit to fast random access as opposed to relatively slow bulk access depending on the workload. You would need a big cache and a good caching strategy to compensate for it. You also have to consider that most data you want probably isn't in the same consecutive 1k, 2k, or 4k region that you're reading from or writing to, so while the maximum theoretical bandwidth is really nice, it's really unlikely that you'd saturate that because of the nature of memory requests that CPUs tend to make compared to GPUs. With that said though, even a fraction of HBM's speed could keep up with traditional DRAM. So maybe it's not as big of a problem as I think it could be.
Posted on Reply
#33
Wirko
pjl321I would have thought moving to stacked HBM would be a much better option and resolve these issues, Apple could easily have 128GB or more of unified memory on the package at around 4TB/s bandwidth (maybe 8TB/s), then have a PCI-Ex 5.0 memory interface to ultra fast SSDs which could have terabytes of memory accessed faster than your average DDR4!
You have just described the 2024 Mac Pro. Starting at $15,000. If anyone develops HBM4 with much reduced latency/faster random access by then, that is.
AquinusThe complication with HBM is that it's slow but wide interface. There is a benefit to fast random access as opposed to relatively slow bulk access depending on the workload. You would need a big cache and a good caching strategy to compensate for it. You also have to consider that most data you want probably isn't in the same consecutive 1k, 2k, or 4k region that you're reading from or writing to, so while the maximum theoretical bandwidth is really nice, it's really unlikely that you'd saturate that because of the nature of memory requests that CPUs tend to make compared to GPUs. With that said though, even a fraction of HBM's speed could keep up with traditional DRAM. So maybe it's not as big of a problem as I think it could be.
Yeah, you explained it nicely.
On the other hand, Apple could also make "traditional" DRAM wider. The M1 apparently has all the DRAM in two packages and a more powerful processor could have four or eight close to the processor die.
Posted on Reply
#34
Aquinus
Resident Wat-man
WirkoYou have just described the 2024 Mac Pro. Starting at $15,000. If anyone develops HBM4 with much reduced latency/faster random access by then, that is.


Yeah, you explained it nicely.
On the other hand, Apple could also make "traditional" DRAM wider. The M1 apparently has all the DRAM in two packages and a more powerful processor could have four or eight close to the processor die.
HBM is efficient because the transistors aren't switched as fast. You lose that advantage if you try to drive it as fast as traditional DRAM. I don't think people really realize how much more power that higher switching frequencies require.
Posted on Reply
#35
bug
AquinusThe complication with HBM is that it's slow but wide interface. There is a benefit to fast random access as opposed to relatively slow bulk access depending on the workload. You would need a big cache and a good caching strategy to compensate for it. You also have to consider that most data you want probably isn't in the same consecutive 1k, 2k, or 4k region that you're reading from or writing to, so while the maximum theoretical bandwidth is really nice, it's really unlikely that you'd saturate that because of the nature of memory requests that CPUs tend to make compared to GPUs. With that said though, even a fraction of HBM's speed could keep up with traditional DRAM. So maybe it's not as big of a problem as I think it could be.
Exactly. Caches solve the latency problem, not the bandwidth problem. HBM is the opposite of that.
Posted on Reply
#36
Wirko
bugExactly. Caches solve the latency problem, not the bandwidth problem. HBM is the opposite of that.
Apple's "Cache DRAM" can only solve that problem if it's a special low-latency type of dynamic RAM. I don't know if anything like that is available, however, Intel's Crystal Well apparently was such a chip, with a latency of ~30 ns in addition to great bandwidth (measured by Anand).
Posted on Reply
#37
Aquinus
Resident Wat-man
bugExactly. Caches solve the latency problem, not the bandwidth problem. HBM is the opposite of that.
Well, cache solves both the bandwidth and latency problem. It doesn't doesn't solve the capacity problem. HBM solves the capacity and bandwidth problem. It's not great on latency, but that's a problem that can be solved, or at the very least, mitigated.

Let me put it another way. HBM2 is the reason why my MacBook Pro is silent with two 5k displays plugged into it. All the other GDDR models would have the fan whirring away due to memory being clocked up to drive them. That's heat and power that you can't afford on a mobile device. It's also how they could cram 40 CUs onto the Radeon Pro 5600m and stay within the 50w power envelope, all while having almost 400GB/s of max theoretical bandwidth. You can't tell me that's not an advantage.
Posted on Reply
#38
Wirko
AquinusWell, cache solves both the bandwidth and latency problem. It doesn't doesn't solve the capacity problem. HBM solves the capacity and bandwidth problem.
And then Apple solves the HBM cost problem by putting the retail price somewhere in the geosynchronous orbit. Violà, problems gone.
Posted on Reply
#39
Aquinus
Resident Wat-man
WirkoAnd then Apple solves the HBM cost problem by putting the retail price somewhere in the geosynchronous orbit. Violà, problems gone.
Truth. I sold a kidney to afford my MacBook Pro. :laugh:
Posted on Reply
#40
bug
AquinusWell, cache solves both the bandwidth and latency problem. It doesn't doesn't solve the capacity problem. HBM solves the capacity and bandwidth problem.
Ok, that's more accurate than what I said.
AquinusIt's not great on latency, but that's a problem that can be solved, or at the very least, mitigated.
I'm not so sure HBM's latency can be as low as required for usage in a cache system.
Latency doesn't seem to move like at all. At least that's what happened for DDR.
Posted on Reply
#41
Aquinus
Resident Wat-man
bugI'm not so sure HBM's latency can be as low as required for usage in a cache system.
Latency doesn't seem to move like at all. At least that's what happened for DDR.
Well, HBM does have a latency penalty, but it makes up for that through its ability to burst a lot of data and because it's split into several channels, you can actually queue up a lot of memory requests and get stuff back rapidfire. So while there is overhead involved, it might not actually be that bad depending on how much data you need to pull at once. Think about it, AMD beefed up the size of its last level of cache with the latest Zen chips. Why, would they do that? The answer is simple, an off die I/O chiplet introduces latency and you need a way to buffer that latency. Depending on the caching strategy, that last level of cache might get a ton of hits and the more hits you get, the more insulated you are from the latency cost.

You also have to consider what Apple is doing. This level in the memory hierarchy has to also be able to support a GPU and AI circuitry as well. HBM is definitely well suited towards those sort of tasks, so all in all, it's probably a wash when it comes to latency. The real advantage comes from the memory bandwidth with relatively low power consumption and a high memory density.
Posted on Reply
Add your own comment
Nov 21st, 2024 13:13 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts