Re: Failed during rebuild (raid5)

"John Stoffel" <john@xxxxxxxxxxx> · Fri, 3 May 2013 12:23:24 -0400

>>>>> "Phil" == Phil Turmel <philip@xxxxxxxxxx> writes:

Phil> On 05/03/2013 09:52 AM, John Stoffel wrote:
>> 
>> After watching endless threads about RAID5 arrays losing a disk, and
>> then losing a second during the rebuild, I wonder if it would make
>> sense to:
>> 
>> - have MD automatically increase all disk timeouts when doing a
>> rebuild.  The idea being that we are more tolerant of a bad sector
>> when rebuilding?  The idea would be to NOT just evict disks when in
>> potentially bad situations without trying really hard.  

Phil> This would be conterproductive for those users who actually
Phil> follow manufacturer guidelines when selecting drives for their
Phil> arrays.

Well for them, which is drives supporting STEC, etc, you'd skip that
step.  But for those using consumer drives, it might make sense.  And
I didn't say to make this change for all arrays, just for those in a
rebuilding state where losing another disk would be potentially
fatal. 

Phil> Anyways, it's a policy issue that belongs in userspace.  Distros
Phil> can do this today if they want.  There's no lack of scripts in
Phil> this list's archives.

Sure, but I'm saying that MD should push the policy to default to
doing this.  You can turn if off if you like and if you know enough.  

>> - Automatically setup an automatic scrub of the array that happens
>> weekly unless you explicitly turn it off.  This would possibly
>> require changes from the distros, but if it could be made a core
>> part of MD so that all the blocks in the array get read each week,
>> that would help with silent failures.

Phil> I understand some distros already do this.

>> We've got all these compute cycles kicking around that could be used
>> to make things even more reliable, we should be using them in some
>> smart way.

Phil> But the "smart way" varies with the hardware at hand.  There's
Phil> no "one size fits all" solution here.

What's the common thread?  A RAID5 loses a disk.  While rebuilding,
another disk goes south.  Poof!  The entire array is toast until you
go through alot of manual steps to re-create.  All I'm suggesting is
that when in a degraded state, MD automatically becomes more tolerant
of timeouts, errors and tries harder to keep going.  

John

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html