Tuesday, January 5th 2021

Linus Torvalds Calls Out Intel for ECC Memory Market Stagnation

Linus Torvalds, the inventor of the Linux kernel and version-control system called git, has posted another one of his famous rants, addressing his views about the lack of ECC memory in consumer devices. Mr. Torvalds has posted his views on the Linux kernel mailing list, where he usually comments about the development of the kernel. The ECC or Error Correcting Code memory is a special kind of DRAM that fixes the problems that occur inside the memory itself, where a bit can get corrupted and change the data stored, thus offering false results. ECC aims to fix those mistakes by implementing a system that fixes these small errors and avoids bigger problems. According to Mr. Torvalds, it is a technology that we need to be implemented everywhere, not just server space like Intel imagines.
Linus TorvaldsIntel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC. Seriously...The arguments against ECC were always complete and utter garbage... Now even the memory manufacturers are starting do do ECC internally because they finally owned up to the fact that they absolutely have to. And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards - let me once again point to row-hammer about how those problems have existed for several generations already, but these f***** happily sold broken hardware to consumers and claimed it was an "attack", when it always was "we're cutting corners".
He continues more about it stating the following:
The "modern DRAM is so reliable that it doesn't need ECC" was always a bedtime story for children that had been dropped on their heads a bit too many times. Yes, I'm pissed off about it. You can find me complaining about this literally for decades now. I don't want to say "I was right". I want this fixed, and I want ECC. And AMD did it. Intel didn't.
Source: Phoronix
Add your own comment

54 Comments on Linus Torvalds Calls Out Intel for ECC Memory Market Stagnation

#27
trparky
efikkanThere is actually a fairly low chance of a single error to cause applications or the OS to crash, most memory errors will only cause data corruption. This is why ECC is often a "requirement" for file servers, it's more about data integrity than uptime.
There you go, I mentioned reading and writing data from system and/or data storage devices.
Posted on Reply
#28
RandallFlagg
Everyone clamoring and getting behind this kind of thing needs to keep in mind that this would affect both performance and price.

So yes there are some "tests" where same speed rated ECC vs non ECC show only a 2% or so performance difference.

The problem comes in with high speed / overclocking of ECC RAM. I don't see anything over 3200 on newegg for example, and that is CL22. I can't find anyone successfully clocking up over DDR4-3200. And, the cheapest ECC DDR4-3200 CL22 is about twice as expensive as non-ecc DDR4-3200 CL16.

In other words, you will pay for it both coming and going.
Posted on Reply
#29
Frick
Fishfaced Nincompoop
RandallFlaggEveryone clamoring and getting behind this kind of thing needs to keep in mind that this would affect both performance and price.

So yes there are some "tests" where same speed rated ECC vs non ECC show only a 2% or so performance difference.

The problem comes in with high speed / overclocking of ECC RAM. I don't see anything over 3200 on newegg for example, and that is CL22. I can't find anyone successfully clocking up over DDR4-3200. And, the cheapest ECC DDR4-3200 CL22 is about twice as expensive as non-ecc DDR4-3200 CL16.

In other words, you will pay for it both coming and going.
Is that because it's harder to make faster ECC modules or is it because the market is segmented that way?
Posted on Reply
#31
RandallFlagg
FrickIs that because it's harder to make faster ECC modules or is it because the market is segmented that way?
ECC uses a very old method of detection using a parity bit. They may call it something different but it's essentially the same. Then it has to run an algorithm - this would be in hardware for ECC - to detect that a bit is wrong.

So for starters, you need more storage to contain the parity bit. See image below, ECC vs non ECC, there's an extra memory chip.

Next, you need that extra circuitry.

From a really high level, you're adding components (extra memory to hold the parity) and an extra process (checking parity, and if it fails - what to do, what can be done, fix it if it's small enough, etc).

So none of that is free, it will exact a toll in additional components, circuitry, and complexity.

Posted on Reply
#32
Frick
Fishfaced Nincompoop
RandallFlaggECC uses a very old method of detection using a parity bit. They may call it something different but it's essentially the same. Then it has to run an algorithm - this would be in hardware for ECC - to detect that a bit is wrong.

So for starters, you need more storage to contain the parity bit. See image below, ECC vs non ECC, there's an extra memory chip.

Next, you need that extra circuitry.

From a really high level, you're adding components (extra memory to hold the parity) and an extra process (checking parity, and if it fails - what to do, what can be done, fix it if it's small enough, etc).

So none of that is free, it will exact a toll in additional components, circuitry, and complexity.

I mean is there a technical reason why there aren't faster ECC RAM or is it because it wouldn't be an interesting product? I know it has extra bits and bobs so you will lose a bit of performance.
Posted on Reply
#33
Aquinus
Resident Wat-man
He's not wrong. Hardware ECC with a dedicated ECC chip is the way to go. You can do it in software or without the extra parity chip, but it's not as good and it costs you memory whereas ECC memory factors that chip into the device's capacity. ECC should be an industry standard for all devices.
RandallFlaggECC uses a very old method of detection using a parity bit. They may call it something different but it's essentially the same. Then it has to run an algorithm - this would be in hardware for ECC - to detect that a bit is wrong.

So for starters, you need more storage to contain the parity bit. See image below, ECC vs non ECC, there's an extra memory chip.

Next, you need that extra circuitry.

From a really high level, you're adding components (extra memory to hold the parity) and an extra process (checking parity, and if it fails - what to do, what can be done, fix it if it's small enough, etc).

So none of that is free, it will exact a toll in additional components, circuitry, and complexity.

It's an extra DRAM chip, all the ECC processing is done in the CPU. It's not really that much more. It's registered/buffered memory that has extra circuitry that's different than a run of the mill DRAM chip. What if that extra bit was addressable if you turned ECC off? That'd be really nice.
Posted on Reply
#34
RandallFlagg
AquinusHe's not wrong. Hardware ECC with a dedicated ECC chip is the way to go. You can do it in software or without the extra parity chip, but it's not as good and it costs you memory whereas ECC memory factors that chip into the device's capacity. ECC should be an industry standard for all devices.


It's an extra DRAM chip, all the ECC processing is done in the CPU. It's not really that much more. It's registered/buffered memory that has extra circuitry that's different than a run of the mill DRAM chip. What if that extra bit was addressable if you turned ECC off? That'd be really nice.
Dude, it's very clear from your post that you don't know what you're talking about. You contradicted yourself like 3 or more times.

Do it in software, without a parity chip? So you're going to put parity in main system RAM, decreasing the amount of available memory, and taking a CPU hit and an absolutely massive latency hit? Nobody does that, and that's why.

"Facotr that chip into the devices capacity" - so you're going to say you have 16GB of RAM, when you really only have 14.5 GB of RAM because 1.5GB of it is for parity? In other words you're going to fudge the numbers to make it look like you didn't lose anything?

Yeah but no.
Posted on Reply
#35
DeathtoGnomes
its like he saves up these rants until he tops his stress levels out and then bursts out, its a blaze of glory alright.. :roll:
Posted on Reply
#36
Solaris17
Super Dainty Moderator


ECC was saving my ass just the other day on one of my data servers. Is it rare I need it? yes. But it's totally worth it when it works as intended.

I always run ECC in production.
Posted on Reply
#37
RoutedScripter
RandallFlaggECC uses a very old method of detection using a parity bit. They may call it something different but it's essentially the same. Then it has to run an algorithm - this would be in hardware for ECC - to detect that a bit is wrong.

So for starters, you need more storage to contain the parity bit. See image below, ECC vs non ECC, there's an extra memory chip.

Next, you need that extra circuitry.

From a really high level, you're adding components (extra memory to hold the parity) and an extra process (checking parity, and if it fails - what to do, what can be done, fix it if it's small enough, etc).

So none of that is free, it will exact a toll in additional components, circuitry, and complexity.

Yes, this is the laws of physics, that's how much ECC costs, that's how much it takes, so be it.

Everything in life is a trade-off, so this really isn't an argument. I want ECC too, I don't care what it takes. Sure it's home PC, sure I play games too, but mostly workstation, and I have archives, data as well, I don't want to lose it either, no matter how much "home PC" is "unimportant" to them, I don't frankly give a rats ass about some random company's data saftey, I CARE ABOUT MY DATA SAFETY. I could be working on an important project and have the PC crash in the middle, even if you recover some point 15 minutes autosave it still causes a multi hour or days lost of time trying to remember where you left off and redo the lost work, etc ... the stupid dumb anti-ECC gamerz-channelz think that's "no big deal" if a PC has a BSOD or an app crashes, well if you're a dumb gamer doing absolutely nothing productive/educational/helpful with a PC, then yes only in that case it's not a big deal, but not everybody wants to be a dumb gamer for the rest of their life.

What about the speed runs, tournaments, and competitive gaming, they effing need ECC too, I'll happen one day, tho we're kinda lucky it doesn't, ... or wait, a desync usually gets blamed on network (ISP, congestion, some server, etc) but it could in reality be caused by corrupt memory, and if you don't know it was the memory that did it, if you don't even have any way to monitor that, that's a problem right there!

In this artificial economic system things that are costly are usually because they're not popular or because someone just doesn't care enough.

Easy fix: make it popular, don't hike the price, done. The worlds elites, corporations, certainly have that power, if they so really care about some asterioid in the middle up nowhere they want to send rockets to, they can make the frigging RAM do ECC for crying out loud, couldn't they :p
Posted on Reply
#38
RandallFlagg
RoutedScripterYes, this is the laws of physics, that's how much ECC costs, that's how much it takes, so be it.

Everything in life is a trade-off, so this really isn't an argument. I want ECC too, I don't care what it takes. Sure it's home PC, sure I play games too, but mostly workstation, and I have archives, data as well, I don't want to lose it either, no matter how much "home PC" is "unimportant" to them, I don't frankly give a rats ass about some random company's data saftey, I CARE ABOUT MY DATA SAFETY. I could be working on an important project and have the PC crash in the middle, even if you recover some point 15 minutes autosave it still causes a multi hour or days lost of time trying to remember where you left off and redo the lost work, etc ... the stupid dumb anti-ECC gamerz-channelz think that's "no big deal" if a PC has a BSOD or an app crashes, well if you're a dumb gamer doing literally absolutely nothing productive with a PC, then yes only in that case it's not a big deal, but not everyone wants to be a dumb gamer for the rest of their life.

In this artificial economic system things that are costly are usually because they're not popular or because someone just doesn't care enough.

Easy fix: make it popular, don't hike the price, done. The worlds elites, corporations, certainly have that power, if they so really care about some asterioid in the middle up nowhere they want to send rockets to, they can make the frigging RAM do ECC for crying out loud, couldn't they :p
Then go buy a HEDT LGA 2066 or a Threadripper and put ECC in it. They aren't that much expensive - maybe an extra $200-$300 for the motherboard and the CPU. And double for the RAM. If it's worth it to you, fine.

But why would everyone else need to pay more to suit your desires? Just pay for it yourself and be done with it.
Posted on Reply
#39
lexluthermiester
I don't want to say "I was right". I want this fixed, and I want ECC. And AMD did it. Intel didn't.
I have to agree with Linus in this point. Intel has dropped the ball in this area of technology, a technology sector that needs advancement. ECC should be standard on all memory modules made at this point in time and it isn't.
Posted on Reply
#40
L'Eliminateur
londisteDoesn't DDR5 have ECC?

Btw, AMD does not officially support ECC on desktop either. What they do better is that they are not preventing its use.
WirkoKind of. From what I can gather and understand, there are 8 additional bits of ECC memory for each 32 bits of user memory, and they are used to detect flipped bits. Not sure if this is mandatory or optional. This is how manufacturers counter the increase in bit errors due to process shrinking, and Linus' words "because they finally owned up to the fact that they absolutely have to" refer to just that.

ECC on the memory bus, however, is NOT mandatory even in DDR5, so system without it will not be able to detect errors that occur when the data is moving.
DDR5 has on-die mandatory ECC, but that's for the memory array and it's transparent to the user(AFAIK there's no reporting of said ECC activity whatsoever, i'd have to read the DDR5 command list to know but it's not public).
As Wirko said, bus ECC(end-to-end) is still optional, thus "ECC modules" will still be a thing, those need TWO extra memory chip for the parity data(hence why they use a 2x40-bit bus instead of a 2x32-bit one).
¿why two?, because DDR5 modules are essentially two-in-one and have 2 independent 32-bit channels, thus you need an extra parity chip for EACH half. BUT it should provide much better ECC support (i guess it will support SDDC across all sizes, not just limited to x4 devices like now)

I expect DDR5-ECC modules to be quite more expensive per-capacity than DDR4 for a long time(2 extra chips PLUS VR circuitry) -maybe someday they'll be lower-
Posted on Reply
#41
Xajel
RandallFlaggI don't see anything over 3200 on newegg for example, and that is CL22. I can't find anyone successfully clocking up over DDR4-3200. And, the cheapest ECC DDR4-3200 CL22 is about twice as expensive as non-ecc DDR4-3200 CL16.

In other words, you will pay for it both coming and going.
This is because all ECC RAM is tailored towards the server market, requiring strict standards compliance. And the JEDEC only certified DDR4 for 3200, so it's the maximum you will see for this purpose (server grade RAM), not to mention the strict testings and requirements makes them use loose timing also to guarantee everything.

If consumer ECC RAM is a thing, you won't see them goes as fast as current non-ECC RAM, but will still be way faster than any server grade RAM, and will also cost less than server RAM duo to way much less testing and verification, but also expect high-end consumer ECC RAM to cost more than similar spec'ed non-ECC RAM. But the high-end ECC and non-ECC might be similar. While consumer ECC RAM requires more testing (and higher cost per module), the non-ECC RAM also is highly binned to be able to reach those clocks and timings, but also reaching those will most probably makes them incapable of meeting consumer ECC standards, that's why you wont see consumer ECC RAM reaches the high clocks and tight timing of high-end non-ECC.

This is only if consumer ECC RAM is a thing, I hope AMD pushes more for it.
Posted on Reply
#42
ThrashZone
RandallFlaggEveryone clamoring and getting behind this kind of thing needs to keep in mind that this would affect both performance and price.

So yes there are some "tests" where same speed rated ECC vs non ECC show only a 2% or so performance difference.

The problem comes in with high speed / overclocking of ECC RAM. I don't see anything over 3200 on newegg for example, and that is CL22. I can't find anyone successfully clocking up over DDR4-3200. And, the cheapest ECC DDR4-3200 CL22 is about twice as expensive as non-ecc DDR4-3200 CL16.

In other words, you will pay for it both coming and going.
Hi,
Yeah 3200c22 or even c16 is no prize, oc ability is not good.
AMD is getting better with memory oc but still same old story only using 2 sticks and 4 sticks still way more handicapped than Intel systems are.
Posted on Reply
#43
trparky
Solaris17ECC was saving my ass just the other day on one of my data servers. Is it rare I need it? yes. But it's totally worth it when it works as intended.

I always run ECC in production.
Wait. What? Are each of those table entries an indication where data was corrupted? Holy crap! :twitch:

What is your data server doing that it is encountering that many memory errors?
Posted on Reply
#44
CheapMeat
RandallFlaggThen go buy a HEDT LGA 2066 or a Threadripper and put ECC in it. They aren't that much expensive - maybe an extra $200-$300 for the motherboard and the CPU. And double for the RAM. If it's worth it to you, fine.

But why would everyone else need to pay more to suit your desires? Just pay for it yourself and be done with it.
that's the damn point! It wouldn't be so much more if it was just standard! What happens when standard??? Prices level out because everyone does it now. The chain becomes to same for all. People.like you keep this BS going. People like you keep markets segmented.
Posted on Reply
#45
Solaris17
Super Dainty Moderator
trparkyWait. What? Are each of those table entries an indication where data was corrupted? Holy crap! :twitch:

What is your data server doing that it is encountering that many memory errors?
nand failed.
Posted on Reply
#46
trparky
Solaris17nand failed.
Oops.
Posted on Reply
#47
efikkan
CheapMeatthat's the damn point! It wouldn't be so much more if it was just standard! What happens when standard??? Prices level out because everyone does it now. The chain becomes to same for all. People.like you keep this BS going. People like you keep markets segmented.
ECC (non-registered) will always need one extra memory chip per rank and a more advanced controller, so we should expect it to cost about 12.5% more.
Let's take an example;
Kingston Server Premier 16GB 3200 MHz dual rank (KSM32ED8/16ME or KSM32ED8/16HD)
vs.
Kingston ValueRAM 16GB 3200 MHz dual rank (KVR32N22D8/16)
If we're using Newegg as a reference, it's $91 vs $79, so a 15% premium. Pretty fair, don't you think?
(In my country it's actually a 4% premium right now…)

I honestly don't think the pricing of ECC (non-registered) memory is the problem. Many of you pay much more for overclocked memory.

For Threadripper ECC support is easy, it's already supported by the platform, so it's just the minor extra cost of ECC memory to account for.

In the Intel camp it's a little more complex, they have three tiers of workstation platforms;
Xeon W 1200 series: (LGA1200)
CPUs have some premium prices, example:
Xeon W-1290P $539 vs. i9-10900K $488 (+10%)
Xeon W-1270P $428 vs. i7-10700K $374 (+14%)
While motherboards are pretty much in line with premium consumer boards;
Supermicro X12SAE ($349 on Newegg)
ASUS Pro WS W480-ACE ($284 on Newegg)
So this prices are pretty acceptable considering these are workstation grade parts.
(PS: Not accounting for the fact that consumer counterparts are more often on discounts)

Xeon W 2200 series: LGA(2066)
CPUs have a more substantial premium:
Xeon W-2255 $778 vs. i9-10900X $590 (+32%)
Xeon W-2295 $1333 vs. i9-10980XE $979 (+36%)
While motherboards are decently priced: (actually cheaper than some X299 motherboards)
Supermicro MBD-X11SRA-F-O ($340 on Newegg)

Xeon W 3200 series: LGA(3647)
These CPUs are very pricey:
Xeon W-3235 $1398
Xeon W-3275 $4449
But have some nice motherboards;
Supermicro X11SPA-T. ($528 on Newegg)
(Take a moment to admire this motherboard.)
Posted on Reply
#48
ThrashZone
Hi,
I'm all for better more reliable xmp profiles.
Posted on Reply
#49
First Strike
tygrusI probably see a Bit flip 5 times per year based on system freeze/crash in my PC with 16GB left on 24/7.
The bit flip is seen only when you can see it. Most of the time bit flip doesn't cause immediate crash thanks to software engineering but that doesn't mean it won't cause problems.

In a non-ECC environment, you will find a heavy floating number calculation program to produce slightly different results during repeated runs, guess why.
Posted on Reply
#50
L'Eliminateur
trparkyWait. What? Are each of those table entries an indication where data was corrupted? Holy crap! :twitch:

What is your data server doing that it is encountering that many memory errors?
The server does not really need to be doing "anything" for memory errors to pop, it could be a single write/read to an address that's bad and it will trigger the ECC.
You could be "idling" in desktop and the errors will trigger all the same, it all depends on what address is being used(maybe it's at the beginning of the array, maybe it's at the end and you don't "see" it until the ram gets full)
those table entries are indeed data corruption events SOLVED by the ECC(so they're transparent to the OS)
Posted on Reply
Add your own comment
Aug 19th, 2024 17:20 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts