On Wed, 27 Jun 2012 19:40:52 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote: > Hi Neil, > > >> > >> I would still think that there is value in recoding in a superblock > >> that a drive is recovering. > > > > Probably. It is a bit unfortunate that if you stop an array that is > > recovering after a --re-add, you cannot simply 'assemble' it again and > > get it back to the same state. > > I'll think more on that. > > As I mentioned, I see the additional re-add as a minor thing, but > agree it's better to fix it. The fact that we don't know that the > drive is being recovered, bothers me more. Because user might look at > the superblock, and assume the data on the drive is consistent to some > point in time (time of the drive failure). While the actual data, > while doing bitmap-based recovery, is unusable until recovery > successfully completes. So the user might think it's okay to try to > run his app on this drive. > Yes, please think about this. > > > > > Meanwhile, this patch might address your other problem. It allows --re-add > > to work if a non-bitmap rebuild fails and is then re-added. > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index c601c4b..d31852e 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info) > > super_types[mddev->major_version]. > > validate_super(mddev, rdev); > > if ((info->state & (1<<MD_DISK_SYNC)) && > > - (!test_bit(In_sync, &rdev->flags) || > > + (test_bit(Faulty, &rdev->flags) || > > rdev->raid_disk != info->raid_disk)) { > > /* This was a hot-add request, but events doesn't > > * match, so reject it. > > > > I have tested a slightly different patch that you suggested earlier - > just removing the !test_bit(In_sync, &rdev->flags) check. I confirm > that it solves the problem. > > The Faulty bit check seems redundant to me, because: > - it can be set by only by validate_super() and only if that drive's > role is 0xfffe in sb->roles[] array > - Long time ago I asked you, how can it happen that a drive thinks > about *itself* that it is Faulty (has 0xfffe for its role in its own > superblock), and you said this should never happen. Yes, you are right. I've remove the test on 'Faulty' - the test on the raid_disk number is sufficient. > > Anyways, I tested also the patch you suggested, and it also works. Thanks! > > Is there any chance to see this fix in ubuntu-precise? Not really up to me. It doesn't fix a crash or corruption so I'm not sure it is -stable material .... though maybe it is if it fixes a regression. I suggest you ask the Ubuntu kernel people after it appears in 3.5-rc (hopefully later this week). Thanks, NeilBrown > > Thanks again for your support, > Alex. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html
Attachment:
signature.asc
Description: PGP signature