Re: recovering RAID5 from multiple disk failures

Phil Turmel <philip@xxxxxxxxxx> · Sat, 02 Feb 2013 19:23:32 -0500

On 02/02/2013 06:08 PM, Chris Murphy wrote:
> 
> On Feb 2, 2013, at 2:56 PM, Michael Ritzert <ksciplot@xxxxxxx> 
> wrote:
>> 
>> Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>>> 
>>> Nevertheless, over an hour and a half is a long time if the file 
>>> system were being updated at all. There'd definitely be 
>>> data/parity mismatches for disk1.
>> 
>> After disk1 failed, the only write access should have been
>> metadata update when the filesystem was mounted.

This is significant.

> Was it mounted ro?
> 
>> I only read data from the filesystem thereafter. So only atime 
>> changes are to be expected, there, and only for a small number of 
>> files that I could capture before disk3 failed. I know which files 
>> are affected, and could leave them alone.
> 
> Even for a small number of files there could be dozens or hundreds
> of chunks altered. I think conservatively you have to consider disk
> 1 out and mount in degraded mode.
> 
>> 
>>> If disk 1 is assumed to be useless, meaning force assemble the 
>>> array in degraded mode; a URE or linux SCSI layer time out is to 
>>> be avoided or the array as a whole fails. Every sector is
>>> needed. So what do you think about raising the linux scsi layer
>>> time out to maybe 2 minutes, and leaving the remaining drive's
>>> SCT ERC alone so that they don't time out sooner, but rather go
>>> into whatever deep recovery they have to in the hopes those bad 
>>> sectors can be read?
>>> 
>>> echo 120 >/sys/block/sdX/device/timeout
>> 
>> I just tried that, but I couldn't see any effect. The error rate 
>> coming in is much higher than 1 every two minutes.
> 
> This timeout is not about error rate. And what the value should be 
> depends on context. Normal operation you want the disk error
> recovery to be short, so that the disk produces a bonafide URE, not a
> SCSI layer timeout error. That way md will correct the bad sector.
> That's what probably wasn't happening in your case, which allowed
> bad sectors to accumulate until it was a real problem.

If you try to recover from the degraded array, this is the correct approach.

> But now, for the cloning process, you want the disk error timeout to 
> be long (or disabled) so that the disk has as long as possible to do 
> ECC to recover each of these problematic sectors. But this also
> means getting the SCSI layer timeout set to at least 1 second longer
> than the longest recovery time for the drives, so that the SCSI layer
> time out doesn't stop sector recovery during cloning. Now maybe the
> disk still won't be able to recover all data from these bad sectors,
> but it's your best shot IMO.

For the array assembled degraded (disk1 left out).

>> When I assemble the array, I will have all new disks (with good 
>> smart selftests...), so I wouldn't expect timeouts. Instead, junk 
>> data will be returned from the sectors in question¹. How will md 
>> react to that?
> 
> Well yeah, with the new drives, they won't report UREs. So there's
> an ambiguity with any mismatch between data and parity chunks as to 
> which is correct. Without a URE, md doesn't know that the data chunk 
> is right or wrong with RAID 5.

Bingo.  Working from the copies guarantees you won't have correct data
where the UREs are.  (The copies are very good to have, of course.)

> Phil may disagree, and I have to defer to his experience in this,
> but I think the most conservative and best shot you have at getting
> the 20GB you want off the array is this:

I do disagree.

The above, combined with:

> I do know where the bad sectors are from the ddrescue report. We are
> talking about less that 50kB bad data on disk1. Unfortunately, disk3
> is worse, but there is no sector that is bad on both disks.

Leads me to recommend "mdadm --create --assume-clean" using the original
drives, taking care to specify the devices in the proper order (per
their "Raid Device" number in the --examine reports).  I still haven't
seen any data that definitively links specific serial numbers to
specific raid device numbers.  Please do that.

After re-creating the array, and setting all the drive timeouts to 7.0
seconds, issue a "check" scrub:

echo "check" >/sys/block/md0/md/sync_action

This should clean up the few pending sectors on disk #1 by
reconstruction from the others, and may very well do the same for disk #3.

If disk #3 gets kicked out at this point, assemble in degraded mode with
disk #2, #4, and a fresh copy of disk #1 (picking up the new superblock
and any fixes during the partial scrub).  Then "--add" a spare (wiped)
disk and let the array rebuild.

And grab your data.

Phil.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html