Re: proactive disk replacement

Phil Turmel <philip@xxxxxxxxxx> · Wed, 22 Mar 2017 10:32:23 -0400

On 03/22/2017 09:53 AM, Gandalf Corvotempesta wrote:

> Last years i've lose a server due to 4 (of 6) disks failures in less
> than an hours during a rebuild.
> 
> The first failure was detected in the middle of the night. It was a
> disconnection/reconnaction of a single disks.
> The riconnection triggered a resync. During the resync another disk
> failed. RAID6 recovered even from this double failure
> but at about 60% of rebuild, the third disk failed bringing the whole raid down.
> 
> I was waked up by our monitoring system and looking at the server,
> there was also a fourth disk down :)
> 
> 4 disks down in less than a hour. All disk was enterprise: SAS 15K,
> not desktop drives.

You should win a prize, Gandalf.  In the several years I've participated
on this mailing list, you are the first to describe such a catastrophe
where the drives really were at fault, instead of timeout mismatch,
power supplies, cables, or controllers.

All four disks had permanent "FAILED" smartctl status after this, yes?

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html