Re: Find mismatch in data blocks during raid6 repair

Robert Buchholz <robert.buchholz@xxxxxxxxxxxx> · Tue, 03 Jul 2012 21:10:41 +0200

Hey Piergiorgio,

On Saturday, June 30, 2012 01:48:31 PM Piergiorgio Sartor wrote:
> > the tool currently can detect failure of a single slot, and it
> > could automatically repair that, I chose to make repair an
> > explicit action. In fact, even the slice number and the two slots
> > to repair are given via the command line.
> > 
> > So for example, given this output of raid6check (check mode):
> > Error detected at 1: possible failed disk slot: 5 --> /dev/sda1
> > Error detected at 2: possible failed disk slot: 3 --> /dev/sdb1
> > Error detected at 3: disk slot unknown
> > 
> > To regenerate 1 and 2, run:
> > raid6check /dev/md0 repair 1 5 3
> > raid6check /dev/md0 repair 2 5 3
> > (the repair arguments require you to always rebuild two blocks,
> > one of which should result in a noop in these cases)
> 
> Why always two blocks?

The reason is simply to have less cases to handle in the code. There's 
already three ways to regenerate regenerate two blocks (D&D, D/P&Q and 
D&P), and there would be two more cases if only one block was to be 
repaired. With the original patch, if you can repair two blocks, that 
allows you to repair one (and one other in addition) as well.

> > Since for stripe 3, two slots must be wrong, the admin has to
> > provide a
> Well, "unknown" means it is not possible to detect
> which one(s).
> It could be there are more than 2 corrupted.
> The "unknown" case means that the only reasonable thing
> would be to rebuild the parities, but nothing more can
> be said about the status of the array.
> 
> Nevertheless, there is a possibility which I was thinking
> about, but I never had time to implement (even if the
> software has some already built-in infrastructure for it).
> Specifically, a "vertical" statistic.
> That is, if there are mismatches, and, for example, 90% of
> them belong to /dev/sdX, and the rest 10% are "unknown",
> then it could be possible to extrapolate that, for the
> "unknown", /dev/sdX must be fixed anyway and then re-check
> if the status is still "unknown" or some other disk shows
> up. If one disk is reported, then it could be fixed.
> Other cases, the parity must be adjusted, whatever this
> means in terms of data recovery.
> 
> Of course, this is just a statistical assumption, which
> means a second, "aggressive", option will have to be
> available, with all the warnings of the case.

As you point out, it is impossible to determine which of two failed 
slots are in error. I would leave such decision to an admin, but giving 
one or more "advices" may be a nice idea.

Personally, I am recovering from a simultaneous three-disk failure on a 
backup storage. My best hope was to ddrescue "most" from all three disks 
onto fresh ones, and I lost a total of a few KB on each disk. Using the 
ddrescue log, I can even say which sectors of each disk were damaged. 
Interestingly, two disks of the same model failed on the very same 
sector (even though they were produced at different times), so I now 
have "unknown" slot errors in some stripes. But with context 
information, I am certain I know which slots need to be repaired.

> > guess (and could iterate guesses, provided proper stripe backups):
> > raid6check /dev/md0 repair 3 5 3
> 
> Actually, this could also be an improvement, I mean
> the possibility to backup stripes, so that other,
> advanced, recovery could be tried and reverted, if
> necessary.

That is true. I was thinking about this too. Unfortunately, as I 
remember, the functions to save and restore stripes in restripe.c do not 
save P and Q, which we should in order to redo the data block 
calculation. But with stripe backups, one could even imagine doing 
verifications on upper layers -- such as verifying file(system) 
checksums. I may send another patch implementing this, but I wanted to 
get general feedback on inclusion of such changes first (Neil?).

> Finally, someone should consider to use the optimized
> raid6 code, from the kernel module (can we link that
> code directly?), in order to speed up the check/repair.

I am a big supporter of getting it to work, then make it fast. Since a 
full raid check takes the magnitude of hours anyway, I do not mind that 
repairing blocks from the user space will take five minutes when it 
could be done in 3. That said, I think the faster code in the kernel is 
warranted (as it needs this calculation very often when a disk is 
failed), and if it is possible to reuse easily, we sure should.

Cheers,

Robert
Attachment:
signature.asc

Description: This is a digitally signed message part.