Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Wed, 27 Jun 2012 19:40:52 +0300

Hi Neil,

>>
>> I would still think that there is value in recoding in a superblock
>> that a drive is recovering.
>
> Probably.  It is a bit unfortunate that if you stop an array that is
> recovering after a --re-add, you cannot simply 'assemble' it again and
> get it back to the same state.
> I'll think more on that.

As I mentioned, I see the additional re-add as a minor thing, but
agree it's better to fix it. The fact that we don't know that the
drive is being recovered, bothers me more. Because user might look at
the superblock, and assume the data on the drive is consistent to some
point in time (time of the drive failure). While the actual data,
while doing bitmap-based recovery, is unusable until recovery
successfully completes. So the user might think it's okay to try to
run his app on this drive.
Yes, please think about this.

>
> Meanwhile, this patch might address your other problem.  It allows --re-add
> to work if a non-bitmap rebuild fails and is then re-added.
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c601c4b..d31852e 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
>                        super_types[mddev->major_version].
>                                validate_super(mddev, rdev);
>                if ((info->state & (1<<MD_DISK_SYNC)) &&
> -                   (!test_bit(In_sync, &rdev->flags) ||
> +                   (test_bit(Faulty, &rdev->flags) ||
>                     rdev->raid_disk != info->raid_disk)) {
>                        /* This was a hot-add request, but events doesn't
>                         * match, so reject it.
>

I have tested a slightly different patch that you suggested earlier -
just removing the !test_bit(In_sync, &rdev->flags) check. I confirm
that it solves the problem.

The Faulty bit check seems redundant to me, because:
- it can be set by only by validate_super() and only if that drive's
role is 0xfffe in sb->roles[] array
- Long time ago I asked you, how can it happen that a drive thinks
about *itself* that it is Faulty (has 0xfffe for its role in its own
superblock), and you said this should never happen.

Anyways, I tested also the patch you suggested, and it also works.

Is there any chance to see this fix in ubuntu-precise?

Thanks again for your support,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html