Re: proactive disk replacement

Wols Lists <antlists@xxxxxxxxxxxxxxx> · Tue, 21 Mar 2017 15:25:45 +0000

On 21/03/17 14:15, David Brown wrote:
>> for most arrays the disks have a similar age and usage pattern, so when
>> > the first one fails it becomes likely that it don't take too long for
>> > another one and so load and recovery time matters

> False.  There is no reason to suspect that - certainly not to within the
> hours or day it takes to rebuild your array.  Disk failure pattern shows
> a peak within the first month or so (failures due to manufacturing or
> handling), then a very low error rate for a few years, then a gradually
> increasing rate after that.  There is not a very significant correlation
> between drive failures within the same system, nor is there a very
> significant correlation between usage and failures.

Except your argument and the claim don't match. You're right - disk
failures follow the pattern you describe. BUT.

If the array was created from completely new disks, then the usage
patterns will be very similar, therefore there will be a statistical
correlation between failures as compared to the population as a whole.
(Bit like a false DNA match is much higher in an inbred town, than in a
cosmopolitan city of immigrants.)

EVEN WORSE. The probability of all the drives coming off the same batch,
and sharing the same systematic defects, is much much higher. One only
has to look at the Seagate 3TB Barracuda mess to see a perfect example.

In other words, IFF your array is built of a bunch of identical drives
all bought at the same time, the risk of multiple failure is
significantly higher. How significant that is I don't know, but it is a
very valid reason for replacing your drives at semi-random intervals.

(Completely off topic :-) but a real-world demonstrable example is
couples' initials. "Like chooses like" and if you compare a couple's
first initials against what you would expect from a random sample, there
is a VERY significant spike in couples that share the same initial.)

To put it bluntly, if your array consists of disks with near-identical
characteristics (including manufacturing batch), then your chances of
random multiple failure are noticeably increased. Is it worth worrying
about? If you can do something about it, of course!

Cheers,
Wol
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html