Re: A sector-of-mismatch warning patch (was Re: Fault tolerance with badblocks)

NeilBrown <neilb@xxxxxxxx> · Mon, 22 May 2017 08:00:42 +1000

On Fri, May 19 2017, Shaohua Li wrote:

> On Fri, May 19, 2017 at 11:32:43AM +0100, Nix wrote:
>> On 19 May 2017, NeilBrown said:
>> 
>> > On Tue, May 16 2017, Nix wrote:
>> >
>> >> On 16 May 2017, NeilBrown spake thusly:
>> >>
>> >>> Actually, I have another caveat.  I don't think we want these messages
>> >>> during initial resync, or any resync.  Only during a 'check' or
>> >>> 'repair'.
>> >>> So add a check for MD_RECOVERY_REQUESTED or maybe for
>> >>>   sh->sectors >= conf->mddev->recovery_cp
>> >>
>> >> I completely agree, but it's already inside MD_RECOVERY_CHECK:
>> >>
>> >> if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
>> >>         /* don't try to repair!! */
>> >>         set_bit(STRIPE_INSYNC, &sh->state);
>> >>         pr_warn_ratelimited("%s: mismatch sector in range "
>> >>                             "%llu-%llu\n", mdname(conf->mddev),
>> >>                             (unsigned long long) sh->sector,
>> >>                             (unsigned long long) sh->sector +
>> >>                             STRIPE_SECTORS);
>> >> } else {
>> >>
>> >> Doesn't that already mean that someone has explicitly triggered a check
>> >> action?
>> >
>> > Uhmm... yeah.  I lose track of which flags me what exactly.
>> > You log messages aren't generated when 'repair' is used, only when
>> > 'check' is.
>> > I can see why you might have chosen that, but I wonder if it is best.
>> 
>> I'm not sure what the point is of being told when repair is used: hey,
>> there was an inconsistency here but there isn't any more! I suppose you
>> could still use it to see if the repair did the right thing. My problem
>> on that front was that I'm not sure what flag should be used to catch
>> repair but not resync etc: everywhere else in the code, repair is in an
>> unadorned else branch... is it the *lack* of MD_RECOVERY_CHECK and the
>> presence of, uh, something else?
> MD_RECOVERY_SYNC && MD_RECOVERY_REQUESTED && MD_RECOVERY_CHECK == check
> MD_RECOVERY_SYNC && MD_RECOVERY_REQUESTED == repair
> MD_RECOVERY_SYNC && !MD_RECOVERY_REQUESTED == resync
>
> Don't see the poin to print the info for 'repair'. 'repair' already changes the
> data, how could we use the info?

Surprising data is can be potentially valuable.
I don't think you should *ever* get an inconsistency in a RAID6 unless
you have faulty hardware.
If you do, then any information about the nature of the inconsistency
might be valuable in understanding the hardware fault.
I don't know in advance how I would interpret the data, but I do
know that if I didn't have the data, then I wouldn't be able to
interpret it.

However .... running "repair" when you don't know exactly what has
happened and why, is probably a bad idea.  So logging probably won't
provide value.
I wouldn't go out of my way to add extra logging for the 'repair' case,
but I certainly wouldn't go out of my way to avoid logging in that case.

It seems inconsistent to log for 'check' but not 'repair', but it isn't
a big deal for me.

NeilBrown
Attachment:
signature.asc

Description: PGP signature