Re: stoppind md from kicking out "bad' drives

Mikael Abrahamsson <swmike@xxxxxxxxx> · Mon, 11 Nov 2013 08:56:54 +0100 (CET)

On Mon, 11 Nov 2013, Michael Tokarev wrote:

No, really, that's not the solutions I was asking for.

Well, it is.

Yes raid6 is better in this context.  But it has exactly the same properties
when drives start "semi-failing" - it is enough to have one bad sector in
different places of 3 drives for a catastrophic failure, while the array
can even continue to work normally because the bad sectors are in different
places.

If you have timeouts set properly then md will be able to re-calculate the 
bad sector from parity and re-write it, even with one drive failed.

It is the drive kick-off - the decision made by md driver - which makes 
the failure catastrophic.

That's what the timeout problem is. If you're running consumer drives and 
default linux kernel timeouts then the drive will be kicked before it can 
return a read error.

We may reduce probability of such event by using different configuration 
tweaks, but the underlying problem remains.

The underlying problem is that you have drives that take longer to return 
errors compared to the settings you have to wait for results from the 
drive.

Nope, because the array were (re)syncing a hot spare, not the first failed
drive.

I don't understand why you would be running a RAID5+spare instead of 
RAID6 without spare.

--
Mikael Abrahamsson    email: swmike@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html