• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

What causes driver corruption?

If you are very, VERY unlucky a cosmic ray can cause a bit flip and corrupt a file, as it passes through your computer o_O
Solar flares/SunSpots
 
You don't have to believe me man
I don't believe you but let me quickly add, I am NOT suggesting you are lying. I am saying there is a lack of understanding (due to a lack of information) what this feature does that has led you (and others) to jump to unfounded and inaccurate conclusions. As Ian in your video clearly points out, there is confusion over what exactly this feature does. I was unsure too - until I watched your video.

Memory that is so unreliable it needs on-die ECC is poor quality
See, now that is total nonsense! Where did Ian or any other expert say that DDR5 (as opposed to DDR4 or DDR3, for example) memory is "so unreliable" that it "needs" this on-die ECC?

They didn't.

Where did any of your experts claim on-die ECC allows makers to make and use "poorer quality" memory?

They didn't. Why? Because that implies memory that does not use this feature is of "superior quality" - which is not true.

What Ian in that video said was because "Man" has yet to learn how to make perfection 100% of the time, there will always be defects in the manufacturing process - especially true as density increases. So what this on-die ECC feature does is allow the manufacturers to detect and correct more and more of these inevitable defects so the memory can meets the require JEDEC standards and be used.

What that does is increase the manufacturing success rate improving the reliability of the manufacturing process (see your video starting at 3:35). And so with fewer production failures, that makes the production costs cheaper - it does NOT mean, in any way, the products are of "poorer quality".

You printed the key points. But sadly, misunderstood what this feature does, or what that video is telling us.

To clarify my point - it is YOUR claim that makers have implemented this ECC-like feature so they can make and sell "poorer quality" RAM. That is just not true.

As YOU quoted and what Ian said (my bold underline added),
on-die ecc allows memory manufacturers to go denser on the process to get higher density memory and more of it comes out the factory and lowers the cost...it enables more scaling down to denser process nodes.

on die ecc helps make the memory cheaper and better yielding.

on die ecc you can actually make sure that more of those cells reach the required jedec specification

And to be sure, "cheaper" in this context means "less expensive". It does not mean inferior quality.

What I see as the problem here is (once again :mad:) "marketing weenies!" sticking their grubby fingers into the mix by describing another product using technical terms incorrectly. They should not have called it ECC - at least not without thoroughly explaining the difference between this on-die ECC and traditional ECC.
 
Hi,
Weird power fluctuations
Really why you should do the basics and rule this out with a good backup battery system something a simple surge protector can't do alone.

Then you can get into windows updates and hardware quirks like cheap psu's/ memory/ ssd's/ hdd's/ failures.
 
I saw the title and thought it was a really good question. I've always assumed that code was code and how the hell could it go wrong. It can't get dementia, or become senile. How does a recorded line of code become corrupted? Is it that the preceding explanations mean that those thousands of lines of code might 'lose a line', or the code itself becomes poorly overwritten (thinking of a BIOS flash being interupted is literally code being rewritten and lost). Is that it?
 
I saw the title and thought it was a really good question. I've always assumed that code was code and how the hell could it go wrong. It can't get dementia, or become senile. How does a recorded line of code become corrupted? Is it that the preceding explanations mean that those thousands of lines of code might 'lose a line', or the code itself becomes poorly overwritten (thinking of a BIOS flash being interupted is literally code being rewritten and lost). Is that it?
Hi,
Yes
Bad wiring connections causing power fluctuations is also very possible
Defective flash drive also if flashing bios like that.
 
I don't believe you but let me quickly add, I am NOT suggesting you are lying. I am saying there is a lack of understanding (due to a lack of information) what this feature does that has led you (and others) to jump to unfounded and inaccurate conclusions. As Ian in your video clearly points out, there is confusion over what exactly this feature does. I was unsure too - until I watched your video.


See, now that is total nonsense! Where did Ian or any other expert say that DDR5 (as opposed to DDR4 or DDR3, for example) memory is "so unreliable" that it "needs" this on-die ECC?

They didn't.

Where did any of your experts claim on-die ECC allows makers to make and use "poorer quality" memory?

They didn't. Why? Because that implies memory that does not use this feature is of "superior quality" - which is not true.

What Ian in that video said was because "Man" has yet to learn how to make perfection 100% of the time, there will always be defects in the manufacturing process - especially true as density increases. So what this on-die ECC feature does is allow the manufacturers to detect and correct more and more of these inevitable defects so the memory can meets the require JEDEC standards and be used.

What that does is increase the manufacturing success rate improving the reliability of the manufacturing process (see your video starting at 3:35). And so with fewer production failures, that makes the production costs cheaper - it does NOT mean, in any way, the products are of "poorer quality".

You printed the key points. But sadly, misunderstood what this feature does, or what that video is telling us.

To clarify my point - it is YOUR claim that makers have implemented this ECC-like feature so they can make and sell "poorer quality" RAM. That is just not true.

As YOU quoted and what Ian said (my bold underline added),


And to be sure, "cheaper" in this context means "less expensive". It does not mean inferior quality.

What I see as the problem here is (once again :mad:) "marketing weenies!" sticking their grubby fingers into the mix by describing another product using technical terms incorrectly. They should not have called it ECC - at least not without thoroughly explaining the difference between this on-die ECC and traditional ECC.

It's really funny how after watching the video you have ended up agreeing with me and yet still proceeded to tell me how I'm completely wrong about everything I said. At least now I know what mansplaining feels like.

You're getting hooked up on the use of 'poorer quality', but that was not why I made my initial point, my point was that it is being marketed as a protection feature when that is NOT why it exists (a point you refused to believe until now, by the way, paraphrasing your words: "why else does ECC exist?"). Oh hey, that's exactly what you just said in your conclusion about the 'marketing weenies', but somehow I'm still wrong? Hmm.

On-die ECC exists in DDR5 so they can use memory that has defects, that would otherwise not pass verification, not to protect our data. That's always been my point. What would you call a defect? Product... ambiguity? Product... special-mystery-feature?

Non-ECC DDR4 would not pass these tests and would not be sold, therefore it is better quality memory. You can have your own definition of quality (which obviously you do), my definition is: the memory does not produce errors in normal operation without on-die ecc. They could produce memory that does not do this, or not to the same degree (and they almost certainly will, for servers), but they don't, why? Cost and cost is about money, not data.

In server farms, where there are DIMMs that produce errors, the technicians don't go "yay, ecc is awesome", they replace the flippin' DIMM (assuming there's not a failed fan, or some other reason), with DDR5 they're using ecc to bake it in defective at manufacture. A different feature, for a completely different reason.

TLDR: Do people buying DDR5 in the marketed belief "it has ecc so my data is protected", realise that potentially they're using memory that produces errors during normal operation, due to manufacturing defects that were passed because of the presence of on-die ecc, or do they believe they're buying something that is inherently better quality than DDR4?
 
I saw the title and thought it was a really good question. I've always assumed that code was code and how the hell could it go wrong. It can't get dementia, or become senile. How does a recorded line of code become corrupted? Is it that the preceding explanations mean that those thousands of lines of code might 'lose a line', or the code itself becomes poorly overwritten (thinking of a BIOS flash being interupted is literally code being rewritten and lost). Is that it?
Reflashing a BIOS, or any ROM for that matter will have this kind of side-effect. Drivers on the filesystem however are a bit different because we get our filesystems guarantees for whatever kind we're using. Most of the time that means writing the new data somewhere else and not over the existing data, so if something bad happens in the process, the data that was written is still just considered free space, so it's not like overwriting a ROM in the sense that if something goes wrong in the middle, you totally have corruption because you only wrote part of what you intended. So long as a driver is contained within a single file on the file system, writing over it and a crash occurring during that process should not corrupt a driver. What could corrupt a driver is if there are multiple files being written to the filesystem and the different versions aren't stored in a different place on the file system, so you could have some files that are newer and some that are older if a crash occurs in between. This is probably what people refer to as a "corrupted driver," which is really just an incomplete installation.

Just my 2¢.
 
Hi,
Have to add third party security suites blocking stuff to.
 
It's really funny...

You're getting hooked up on the use of 'poorer quality'

my point was that it is being marketed as...

I am "hooked up on it" because it is wrong!

What's funny, in a really sad, if not ironic way, is how you criticize me for "getting hooked up" your claim of these makers making "poorer quality" memory - then you are all "hooked up on" how they are "marketing" this new feature. :rolleyes: :kookoo:

So, in other words, it is perfectly fine for you to misuse words, but others cannot.

Fine. You win. But just for the record - I admitted I too was confused with the "marketing" terminology used and clearly said they should not have called it ECC - at least not without proper descriptions.

I'm outta here.
 
Last edited:
So, according to that video, on-die ECC really is about making cheaper memory that would otherwise fail validation and also allows higher density, not data integrity. I'd have to watch other videos from reliable sources to be convinced of the exact reasons for this feature.

What strikes me, is that this tech can mask inherently less reliable memory which would tend to have a shorter lifespan. Time will tell and if the warranty lengths start to shrink, it will be very telling.
 
So, according to that video, on-die ECC really is about making cheaper memory that would otherwise fail validation and also allows higher density, not data integrity. I'd have to watch other videos from reliable sources to be convinced of the exact reasons for this feature.

What strikes me, is that this tech can mask inherently less reliable memory which would tend to have a shorter lifespan. Time will tell and if the warranty lengths start to shrink, it will be very telling.

What scares me is error correction in hard drives

Non-recoverable Errors 1 per 10^14

Well 10^14 bits is just 11 Terabytes
 
So, according to that video, on-die ECC really is about making cheaper memory that would otherwise fail validation
:) Yes!

"BUT" memory that would otherwise fail validation exactly as other generation/type memory would, has done, and will continue to do (until "Man" can create perfection). That is, not because suddenly and now with DDR5, the memory is otherwise inferior (or "poorer" :rolleyes: quality) than those previous generations.
What scares me is error correction in hard drives
That's what having a robust backup plan is for.
 
I am "hooked up on it" because it is wrong!

What's funny, in a really sad, if not ironic way, is how you criticize me for "getting hooked up" your claim of these makers making "poorer quality" memory - then you are all "hooked up on" how they are "marketing" this new feature. :rolleyes: :kookoo:

So, in other words, it is perfectly fine for you to misuse words, but others cannot.

Fine. You win. But just for the record - I admitted I too was confused with the "marketing" terminology used and clearly said they should not have called it ECC - at least not without proper descriptions.

I'm outta here.

Actually, I stand by what I said, I repeated the claim (with a definition) in my last post. Any elaboration would be unnecessary at this point, so I'll refrain.

About the marketing: yup, there's a little red book in my bed table, Bill, if it wasn't obvious already.
 
Does the DDR5 report the number of errors corrected? if so, one would know if the error correction was being used to shroud bad RAM.
 
Does the DDR5 report the number of errors corrected? if so, one would know if the error correction was being used to shroud bad RAM.

In the documentation the feature appears to be present for reporting, but they also talk about how the on-die ecc is to address internal errors (i.e. not visible to the system), so it is unclear (at least, it is to me) if consumer systems can actually do this. NAS users would probably have the answer already. In the comments of his article about the specification, Ryan Smith said it is done 'transparently':

On-die ECC is to improve the reliability of individual chips. Between the number of bits per chip getting quite high, and newer nodes getting successively harder to develop, the odds of a single-bit error is getting uncomfortably high. So on-die ECC is meant to counter that, by transparently dealing with single-bit errors.

It's similar in concept to error correction on SSDs (NAND): the error rate is high enough that a modern TLC SSD without error correction would be unusable without it. Otherwise if your chips had to be perfect, these ultra-fine processes would never yield well enough to be usable.

Which appears t suggest the opposite of what I'd assume it meant.

A funny, sort of related video (@ 35:40):

 
Last edited:
:) Yes!

"BUT" memory that would otherwise fail validation exactly as other generation/type memory would, has done, and will continue to do (until "Man" can create perfection). That is, not because suddenly and now with DDR5, the memory is otherwise inferior (or "poorer" :rolleyes: quality) than those previous generations.

Right, I can see where the controversy is coming from: it's all about intention from the memory mfrs.

First, let me preface this reply by saying that the first time I heard about on-die ECC is this thread (I've read all the posts) and having watched the video that @Tetras linked to, so my reply is based on this only.

Here's a thought experiment to illustrate the point and using made up values, not real ones.

Scenario 1
Imagine "regular" quality memory having a reliability of 2. If we add on-die ECC to it, the reliability doubles to 4. Wicked, everyone's happy, thumbs up.

Scenario 2
By the same token, memory with half the reliability of regular quality memory can be used which will have a reliability of 1. Doubling that with on-die ECC makes it 2, the same as regular memory. Is this such a good thing? Grey area, just the sort of thing I don't like.

In both scenarios, on-die ECC is doing exactly the same thing, but has more work to do as more errors have to be corrected during normal operation. Imagine the case where a DDR5 cell is faulty and is stuck at zero. The on-die ECC will correct it every time that it should have stored a one, lowering resiliency to random errors from the likes of cosmic rays etc.

It seems to me, that since the error rate inherently goes up as the cells get smaller (higher density), are clocked faster and run at a lower voltage, so some mitigating strategy like on-die ECC must be used to keep these errors under control. We're mitigating against the effects of physics here as well as using lower quality memory.

According to the video, the second scenario is the correct one however, and without expensive, esoteric test equipment and the requisite knowledge to operate it and understand the results, it's impossible for anyone other than the memory mfrs to know which scenario is correct. It could even be a blend of scenarios 1 and 2 for all we know.

Personally, I'd have more confidence in DDR5 if it didn't need on-die ECC to work properly and hence may choose a DDR4 system for my next upgrade due to this. I'd rather lose some performance (which is usually not very much) to get better reliability and peace of mind.
 
Actually, I stand by what I said, I repeated the claim (with a definition) in my last post.
So you changed the accepted definition of "poorer" to something different and inaccurate, then declare you are right.

Got it. :rolleyes:

Does the DDR5 report the number of errors corrected? if so, one would know if the error correction was being used to shroud bad RAM.

That video, IIRC, suggested this feature is used in the production process only. That is, on the individual chips by the chip makers. That is, BEFORE they are mounted on sticks by the stick makers. Therefore, I have to assume this is information not available to us consumers.

I am going to grab my jet-ski now and hit the bicycle trails. I know this is normally called a "bicycle", but I have decided to change the definition of bicycle. That means I am going to hit the jet-ski trail. ;)
 
Hi,
A lot is assumed in memory department area corruption going all the way to new current ddr5 possible bugs
Op didn't get into setups/ oem or custom builds having these driver issues

But assuming people having issues with drivers are using xmp profiles or just oc'ing memory is a large jump imho to the op's question.
 
So you changed the accepted definition of "poorer" to something different and inaccurate, then declare you are right.

Got it. :rolleyes:

I never gave a definition in my initial post, I just said "poorer quality" and after that I said it is "so unreliable it needs on-die ECC". My subsequent definition just clarifies what memory quality means to me. I can't see any inconsistency there. I've stated my belief relative to DDR4 non-ecc and on-die ecc DDR5 and why I think it.

You don't agree and you've stated why my sources don't agree with my claim, I don't have a problem with that.
 
and after that I said it is "so unreliable it needs on-die ECC".
Which is not correct! And it implies DDR4 is reliable because it does not need on-die ECC - also incorrect.
My subsequent definition just clarifies what memory quality means to me.
:( You cannot change the definitions of words so they now make sense to you. That is not how language works. If "poorer" doesn't define what you mean, then pick a different word! Don't keep pretending it means what ever you want it to mean.
 
Which is not correct! And it implies DDR4 is reliable because it does not need on-die ECC - also incorrect.

:( You cannot change the definitions of words so they now make sense to you. That is not how language works. If "poorer" doesn't define what you mean, then pick a different word! Don't keep pretending it means what ever you want it to mean.

About DDR4/DDR5: I have nothing to say to that, that I haven't already said. Not being difficult, I'm just happy to let what I've said stand.

About the definition: I honestly have no idea what you mean, might be better to take to PM at this point, if you wish.
 
Corrupted data is usually bad hardware or hardware run out of spec.

I have seen people on steam believe its common place and every time someone has a problem they are told to redownload the game, but this is an example of advice that balloons due to it actually been the cause for a few people. Kind of like the old reboot to fix things.

When its not bad hardware it can be down to operator error e.g. download drivers on asynchronous write, then hard power off the machine within a few seconds of the download complete prompt.

There is next gen filesystems to reduce the risk considerably such as ZFS and REFS.
 
About the definition: I honestly have no idea what you mean, might be better to take to PM at this point, if you wish.
Yup you should.
 
Hi,
Indeed drive issues is more likely than memory.
 
If you are very, VERY unlucky a cosmic ray can cause a bit flip and corrupt a file, as it passes through your computer o_O

You made me laugh. Of coarse you are right, but man that would be unfortunate, but also very funny.

EDIT: Try RMA your computer & tell the helpdesk it was hit by a "Gamma Ray Burst" from outer space. I can imagine the helpdesk laughing & not believing you.
 
Last edited:
Back
Top