Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)

NeilBrown <neilb@xxxxxxx> · Mon, 2 Jul 2012 17:57:15 +1000

On Wed, 27 Jun 2012 19:40:52 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
wrote:

> Hi Neil,
> 
> >>
> >> I would still think that there is value in recoding in a superblock
> >> that a drive is recovering.
> >
> > Probably.  It is a bit unfortunate that if you stop an array that is
> > recovering after a --re-add, you cannot simply 'assemble' it again and
> > get it back to the same state.
> > I'll think more on that.
> 
> As I mentioned, I see the additional re-add as a minor thing, but
> agree it's better to fix it. The fact that we don't know that the
> drive is being recovered, bothers me more. Because user might look at
> the superblock, and assume the data on the drive is consistent to some
> point in time (time of the drive failure). While the actual data,
> while doing bitmap-based recovery, is unusable until recovery
> successfully completes. So the user might think it's okay to try to
> run his app on this drive.
> Yes, please think about this.
> 
> >
> > Meanwhile, this patch might address your other problem.  It allows --re-add
> > to work if a non-bitmap rebuild fails and is then re-added.
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index c601c4b..d31852e 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
> >                        super_types[mddev->major_version].
> >                                validate_super(mddev, rdev);
> >                if ((info->state & (1<<MD_DISK_SYNC)) &&
> > -                   (!test_bit(In_sync, &rdev->flags) ||
> > +                   (test_bit(Faulty, &rdev->flags) ||
> >                     rdev->raid_disk != info->raid_disk)) {
> >                        /* This was a hot-add request, but events doesn't
> >                         * match, so reject it.
> >
> 
> I have tested a slightly different patch that you suggested earlier -
> just removing the !test_bit(In_sync, &rdev->flags) check. I confirm
> that it solves the problem.
> 
> The Faulty bit check seems redundant to me, because:
> - it can be set by only by validate_super() and only if that drive's
> role is 0xfffe in sb->roles[] array
> - Long time ago I asked you, how can it happen that a drive thinks
> about *itself* that it is Faulty (has 0xfffe for its role in its own
> superblock), and you said this should never happen.

Yes, you are right.  I've remove the test on 'Faulty' - the test on the
raid_disk number is sufficient.

> 
> Anyways, I tested also the patch you suggested, and it also works.

Thanks!

> 
> Is there any chance to see this fix in ubuntu-precise?

Not really up to me.  It doesn't fix a crash or corruption so I'm not sure it
is -stable material .... though maybe it is if it fixes a regression.

I suggest you ask the Ubuntu kernel people after it appears in 3.5-rc
(hopefully later this week).

Thanks,
NeilBrown

> 
> Thanks again for your support,
> Alex.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment:
signature.asc

Description: PGP signature