ANOTHER ARTICLE ON THIS SUBJECT (rated well by others @ SLASHDOT)
Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?
http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html
================================
KEY POINTS SUMMARY:
================================
Infant mortality? - Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors.”
-----------------------------------------
Vendor MTBF reliability?. . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.
-----------------------------------------
Vendor MTBF reliability? While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by dataset and type, are by up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%rang
------------------------------------------
Actual MTBFs? The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours.
------------------------------------------
Drive reliability after burn-in? Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.
------------------------------------------
Data safety under RAID 5?. . . a key application of the exponential assumption is in estimating the time until data loss in a RAID system. This time depends on the probability of a second disk failure during reconstruction, a process which typically lasts on the order of a few hours. The . . . exponential distribution greatly underestimates the probability of a second failure . . . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .
-------------------------------------------
Independence of drive failures in an array? The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.
-------------------------------------------
* You guys may wish to check that out... I found it @ SLASHDOT today, & it continues to expand on this topic, albeit from another set of researchers findings.
APK
P.S.=> This is what I meant above in 1 of my posts here about GOOGLE's findings being "the definitive work" on this topic... although well done & from a heck of a sampleset (their doubtless CONSTANTLY pounded on disks for their servers in their search engine), this one seems to have been rated HIGHER as a good analysis of this @ SLASHDOT & per this quote:
http://hardware.slashdot.org/hardware/07/02/21/004233.shtml
---------------------------------------
"Google's wasn't the best storage paper at FAST '07. Another, more provocative paper looking at real-world results from 100,000 disk drives got the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab"
---------------------------------------
Are SLASHDOT readers "the final word"? No, nobody really is, & no one knows it all in this field or is the "God of Computing" etc. but, they are another point of reference for you all into this topic!
(As far as slashdotters go? Well, imo, many are TOO "Pro-Linux/Anti-Microsoft", but there are guys there that REALLY know their stuff as well... take a read, & enjoy if this topic's "YOU"... )
That said, if the actual paper's "TOO MUCH" & it does get that way @ times...? You can always skim thru what the readers @ slashdot stated... they offer much of what it says, in more "human language terms/laymen's terms", & often 'from the trenches'... apk