Tuesday, January 5th 2021

Linus Torvalds Calls Out Intel for ECC Memory Market Stagnation
Linus Torvalds, the inventor of the Linux kernel and version-control system called git, has posted another one of his famous rants, addressing his views about the lack of ECC memory in consumer devices. Mr. Torvalds has posted his views on the Linux kernel mailing list, where he usually comments about the development of the kernel. The ECC or Error Correcting Code memory is a special kind of DRAM that fixes the problems that occur inside the memory itself, where a bit can get corrupted and change the data stored, thus offering false results. ECC aims to fix those mistakes by implementing a system that fixes these small errors and avoids bigger problems. According to Mr. Torvalds, it is a technology that we need to be implemented everywhere, not just server space like Intel imagines.
Source:
Phoronix
Linus TorvaldsIntel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation... Intel has been detrimental to the whole industry and to users because of their bad and misguided policies wrt ECC. Seriously...The arguments against ECC were always complete and utter garbage... Now even the memory manufacturers are starting do do ECC internally because they finally owned up to the fact that they absolutely have to. And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards - let me once again point to row-hammer about how those problems have existed for several generations already, but these f***** happily sold broken hardware to consumers and claimed it was an "attack", when it always was "we're cutting corners".He continues more about it stating the following:
The "modern DRAM is so reliable that it doesn't need ECC" was always a bedtime story for children that had been dropped on their heads a bit too many times. Yes, I'm pissed off about it. You can find me complaining about this literally for decades now. I don't want to say "I was right". I want this fixed, and I want ECC. And AMD did it. Intel didn't.
54 Comments on Linus Torvalds Calls Out Intel for ECC Memory Market Stagnation
So yes there are some "tests" where same speed rated ECC vs non ECC show only a 2% or so performance difference.
The problem comes in with high speed / overclocking of ECC RAM. I don't see anything over 3200 on newegg for example, and that is CL22. I can't find anyone successfully clocking up over DDR4-3200. And, the cheapest ECC DDR4-3200 CL22 is about twice as expensive as non-ecc DDR4-3200 CL16.
In other words, you will pay for it both coming and going.
So for starters, you need more storage to contain the parity bit. See image below, ECC vs non ECC, there's an extra memory chip.
Next, you need that extra circuitry.
From a really high level, you're adding components (extra memory to hold the parity) and an extra process (checking parity, and if it fails - what to do, what can be done, fix it if it's small enough, etc).
So none of that is free, it will exact a toll in additional components, circuitry, and complexity.
Do it in software, without a parity chip? So you're going to put parity in main system RAM, decreasing the amount of available memory, and taking a CPU hit and an absolutely massive latency hit? Nobody does that, and that's why.
"Facotr that chip into the devices capacity" - so you're going to say you have 16GB of RAM, when you really only have 14.5 GB of RAM because 1.5GB of it is for parity? In other words you're going to fudge the numbers to make it look like you didn't lose anything?
Yeah but no.
ECC was saving my ass just the other day on one of my data servers. Is it rare I need it? yes. But it's totally worth it when it works as intended.
I always run ECC in production.
Everything in life is a trade-off, so this really isn't an argument. I want ECC too, I don't care what it takes. Sure it's home PC, sure I play games too, but mostly workstation, and I have archives, data as well, I don't want to lose it either, no matter how much "home PC" is "unimportant" to them, I don't frankly give a rats ass about some random company's data saftey, I CARE ABOUT MY DATA SAFETY. I could be working on an important project and have the PC crash in the middle, even if you recover some point 15 minutes autosave it still causes a multi hour or days lost of time trying to remember where you left off and redo the lost work, etc ... the stupid dumb anti-ECC gamerz-channelz think that's "no big deal" if a PC has a BSOD or an app crashes, well if you're a dumb gamer doing absolutely nothing productive/educational/helpful with a PC, then yes only in that case it's not a big deal, but not everybody wants to be a dumb gamer for the rest of their life.
What about the speed runs, tournaments, and competitive gaming, they effing need ECC too, I'll happen one day, tho we're kinda lucky it doesn't, ... or wait, a desync usually gets blamed on network (ISP, congestion, some server, etc) but it could in reality be caused by corrupt memory, and if you don't know it was the memory that did it, if you don't even have any way to monitor that, that's a problem right there!
In this artificial economic system things that are costly are usually because they're not popular or because someone just doesn't care enough.
Easy fix: make it popular, don't hike the price, done. The worlds elites, corporations, certainly have that power, if they so really care about some asterioid in the middle up nowhere they want to send rockets to, they can make the frigging RAM do ECC for crying out loud, couldn't they :p
But why would everyone else need to pay more to suit your desires? Just pay for it yourself and be done with it.
As Wirko said, bus ECC(end-to-end) is still optional, thus "ECC modules" will still be a thing, those need TWO extra memory chip for the parity data(hence why they use a 2x40-bit bus instead of a 2x32-bit one).
¿why two?, because DDR5 modules are essentially two-in-one and have 2 independent 32-bit channels, thus you need an extra parity chip for EACH half. BUT it should provide much better ECC support (i guess it will support SDDC across all sizes, not just limited to x4 devices like now)
I expect DDR5-ECC modules to be quite more expensive per-capacity than DDR4 for a long time(2 extra chips PLUS VR circuitry) -maybe someday they'll be lower-
If consumer ECC RAM is a thing, you won't see them goes as fast as current non-ECC RAM, but will still be way faster than any server grade RAM, and will also cost less than server RAM duo to way much less testing and verification, but also expect high-end consumer ECC RAM to cost more than similar spec'ed non-ECC RAM. But the high-end ECC and non-ECC might be similar. While consumer ECC RAM requires more testing (and higher cost per module), the non-ECC RAM also is highly binned to be able to reach those clocks and timings, but also reaching those will most probably makes them incapable of meeting consumer ECC standards, that's why you wont see consumer ECC RAM reaches the high clocks and tight timing of high-end non-ECC.
This is only if consumer ECC RAM is a thing, I hope AMD pushes more for it.
Yeah 3200c22 or even c16 is no prize, oc ability is not good.
AMD is getting better with memory oc but still same old story only using 2 sticks and 4 sticks still way more handicapped than Intel systems are.
What is your data server doing that it is encountering that many memory errors?
Let's take an example;
Kingston Server Premier 16GB 3200 MHz dual rank (KSM32ED8/16ME or KSM32ED8/16HD)
vs.
Kingston ValueRAM 16GB 3200 MHz dual rank (KVR32N22D8/16)
If we're using Newegg as a reference, it's $91 vs $79, so a 15% premium. Pretty fair, don't you think?
(In my country it's actually a 4% premium right now…)
I honestly don't think the pricing of ECC (non-registered) memory is the problem. Many of you pay much more for overclocked memory.
For Threadripper ECC support is easy, it's already supported by the platform, so it's just the minor extra cost of ECC memory to account for.
In the Intel camp it's a little more complex, they have three tiers of workstation platforms;
Xeon W 1200 series: (LGA1200)
CPUs have some premium prices, example:
Xeon W-1290P $539 vs. i9-10900K $488 (+10%)
Xeon W-1270P $428 vs. i7-10700K $374 (+14%)
While motherboards are pretty much in line with premium consumer boards;
Supermicro X12SAE ($349 on Newegg)
ASUS Pro WS W480-ACE ($284 on Newegg)
So this prices are pretty acceptable considering these are workstation grade parts.
(PS: Not accounting for the fact that consumer counterparts are more often on discounts)
Xeon W 2200 series: LGA(2066)
CPUs have a more substantial premium:
Xeon W-2255 $778 vs. i9-10900X $590 (+32%)
Xeon W-2295 $1333 vs. i9-10980XE $979 (+36%)
While motherboards are decently priced: (actually cheaper than some X299 motherboards)
Supermicro MBD-X11SRA-F-O ($340 on Newegg)
Xeon W 3200 series: LGA(3647)
These CPUs are very pricey:
Xeon W-3235 $1398
Xeon W-3275 $4449
But have some nice motherboards;
Supermicro X11SPA-T. ($528 on Newegg)
(Take a moment to admire this motherboard.)
I'm all for better more reliable xmp profiles.
In a non-ECC environment, you will find a heavy floating number calculation program to produce slightly different results during repeated runs, guess why.
You could be "idling" in desktop and the errors will trigger all the same, it all depends on what address is being used(maybe it's at the beginning of the array, maybe it's at the end and you don't "see" it until the ram gets full)
those table entries are indeed data corruption events SOLVED by the ECC(so they're transparent to the OS)