Re: data scrubbing

Beolach <beolach@xxxxxxxxx> · Fri, 29 Jul 2011 16:37:38 -0600

On Fri, Jul 29, 2011 at 15:51, Mathias Burén <mathias.buren@xxxxxxxxx> wrote:
> On 29 July 2011 21:48, Beolach <beolach@xxxxxxxxx> wrote:
>> On Fri, Jul 29, 2011 at 07:25, Nikolay Kichukov <hijacker@xxxxxxxxx> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Hi,
>>>
>>> This is a good to know!
>>>
>>> Just performed a check on a raid1 and got:
>>>
>>> Jul 29 15:37:36 hanna64 mdadm[2277]: RebuildFinished event detected on md device /dev/md1, component device  mismatches
>>> found: 128
>>>
>>> So I presume those mismatches have now been rewritten to both disks successfully. Am I wrong there?
>>>
>>> cat /sys/block/md1/md/mismatch_cnt
>>> 128
>>>
>>>
>>
>> That depends on if you did a "check" or a "repair" - see the SCRUBBING
>> AND MISMATCHES section of the md(4) man page:
>> "If  check  was used, then no action is taken to handle the mismatch,
>> it is simply recorded.  If repair  was  used,  then  a  mismatch  will
>>  be repaired  in  the same way that resync repairs arrays."
>>
>>
>> Good luck,
>> Beolach
>
> Sorry to chime in like this. After reading the above, is there a
> reason why anyone shouldn't _always_ use repair instead of check on a
> weekly RAID6 check? You have to run repair anyway after a check if any
> issues are found, right?
>
> Or does the system become vulnerable during a repair? (less redundant)
>
> Thanks,
> Mathias
>

The primary purpose of data scrubbing a RAID is to detect & correct
read errors on any of the member devices; both check and repair
perform this function.  Finding (and w/ repair correcting) mismatches
is only a secondary purpose - it is only if there are no read errors
but the data copy or parity blocks are found to be inconsistent that a
mismatch is reported.  In order to repair a mismatch, MD needs to
restore consistency, by over writing the inconsistent data copy or
parity blocks w/ the correct data.  But, because the underlying member
devices did not return any errors, MD has no way of knowing which
blocks are correct, and which are incorrect; when it is told to do a
repair, it makes the assumption that the first copy in a RAID1 or
RAID10, or the data (non-parity) blocks in RAID4/5/6 are correct, and
corrects the mismatch based on that assumption.

That assumption may or may not be correct, but MD has no way of
determining that reliably - but the user might be able to, by using
additional knowledge or tools, so MD gives the user the option to
perform data scrubbing either with (repair) or without (check) MD
correcting the mismatches using that assumption.

I hope that answers your question,
Beolach
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html