Re: Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 28 Jan 2013 16:07:59 -0700

On Jan 28, 2013, at 3:59 PM, Wolfgang Denk <wd@xxxxxxx> wrote:

> Dear Chris,
> 
> In message <6D287BCE-96EB-4F91-AC5A-34CD7AD2C68D@xxxxxxxxxxxxxxxxx> you wrote:
>> 
>> Yes, it sounds reproducible on more than one array, more than one HBA. Is it also more than one computer also, Wolfgang?
> 
> Correct, these are 3 different machines.

Too bad. Better to test first, than commit so many computers and arrays for such a major change.
> 
>> I think regression is going to be needed to find it. Hopefully the
>> problem is restricted to parity computation and data chunks aren't
>> affected; however if a URE occurs in a data chunk, it could be
>> reconstructed incorrectly from bad parity so it's obviously still a
>> big problem.
> 
> My gut feeling is that the data are still OK, but I have to admit that
> I inspected only a small fraction of the files, and I would like to
> avoid restoring the data from backup tapes to another system as long
> as possible.  So it indeed appears to me as if we had a sotware issue,
> computing incorrect parity data.

Unclear. If parity chunks are both wrong, then that means you effectively have partial RAID 0 depending on what parity chunks are correct or not. I'm not recommending this, but if you set one disk to faulty and started your file system and file tests again… if they're bad then indeed it's parity that's affected. If you don't get errors, then it indicates the test method is insufficient to locate the errors and it could still be data that's affected.

It's a tenuous situation. It might be wise to pick a low priority computer for regression, and hopefully the problem gets better rather than worse. If the assumption is that the parity is bad, it needs to be recalculated with repair. If that goes well with tests and another check scrub, then it's better to get on with additional regressions sooner than later. Again in the meantime if you lost a drive, it could be a real mess if the raid starts to rebuild bad data from parity. Or even starts to write user data incorrectly too.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html