Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Wed, 26 Feb 2014 15:02:27 +0200

HI Neil,
can you please comment what do you think of this change.

Thanks,
Alex.

On Wed, Feb 19, 2014 at 11:51 AM, Alexander Lyakas
<alex.bolshoy@xxxxxxxxx> wrote:
> Hello Neil, Andrei,
> I don't how much you recall of this old discussion.
>
> Basically, the change that you made means that md doesn't update the
> superblock on the recovering device, until recovery completes. As a
> result, when assembling such array, the recovering device has an old
> event count in the superblock and is not picked during assembly. So
> later, user has to re-add it manually.
> This is true for a device that failed and was re-added. For a fresh
> device, saved_raid_disk==-1, so its superblock will still be updated
> (and sucn device will be picked on assembly).
>
> On the other hand, MD updates the superblock of the In_sync devices
> and marks the recovering device as having a valid array slot (not
> 0xFFFF or 0xFFFE, which are treated similarly today).
>
> What do you think of the following change:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 561a65f..4bbc7e3 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1854,8 +1854,6 @@ retry:
>                         sb->dev_roles[i] = cpu_to_le16(0xfffe);
>                 else if (test_bit(In_sync, &rdev2->flags))
>                         sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
> -               else if (rdev2->raid_disk >= 0)
> -                       sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
>                 else
>                         sb->dev_roles[i] = cpu_to_le16(0xffff);
>         }
>
> Basically, we do not mark the slot of the recovering device, until its
> recovery completes. So during assembly, we will not pick it, as
> before. For a fresh drive, there will be a regression - we will not
> pick it as well on assembly.
>
> The reason I am considering this change is another old discussion that
> we had - considering split-brain resolution, and the proposal I made
> in:
> https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE
>
> Basically, when MD marks recovering devices as having a valid raid
> slot in the superblock of In_sync devices, then on array assembly we
> don't know whether sucn device is recovering or In_sync (if it is
> inaccessible during assembly). So we have to assume it is In_sync and
> thus can potentially cause split-brain.
>
>
> What do you think of this change?
>
> Thanks,
> Alex.
>
>
> On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@xxxxxxx> wrote:
>> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
>> wrote:
>>
>>> Thanks for commenting, Neil,
>>>
>>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@xxxxxxx> wrote:
>>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
>>> > wrote:
>>> >
>>> >> Hi again Neil, and Andrey,
>>> >>
>>> >> looking at this email thread:
>>> >> http://www.spinics.net/lists/raid/msg36236.html
>>> >> between you and Andrey, the conclusion was:
>>> >> "So the correct thing to do is to *not* update the metadata on the
>>> >> recovering device until recovery completes.  Then if it fails and is
>>> >> re-added, it will look just the same as when it was re-added the first
>>> >> time, and will do a bitmap-based recovery."
>>> >>
>>> >> I have two doubts about this decision:
>>> >> 1) Since the event count on the recovering drive is not updated, this
>>> >> means that after reboot, when array is re-assembled, this drive will
>>> >> not be added to the array, and the user will have to manually re-add
>>> >> it. I agree this is a minor thing.
>>> >
>>> > Still, if it can be fixed it should be.
>>> >
>>> >> 2) There are places in mdadm, which check for recovery_offset on the
>>> >> drive and take decisions based upon that. Specifically, if there is
>>> >> *no* recovery offset, the data on this drive is considered to be
>>> >> consistent WRT to the failure time of the drive. So, for example, the
>>> >> drive can be a candidate for bumping up events during "force
>>> >> assembly". Now, when superblock on such drive is not updated during
>>> >> recovery (so there is *no* recovery offset), mdadm will think that the
>>> >> drive is consistent, while in fact, its data is totally unusable until
>>> >> after recovery completes. That's because we have updated parts of the
>>> >> drive, but did not complete bringing the whole drive in-sync.
>>> >
>>> > If mdadm would consider updating the event count if not recovery had started,
>>> > then surely it is just as valid to do so once some recovery has started, even
>>> > if it hasn't completed.
>>>
>>> The patch you accepted from me ("Don't consider disks with a valid
>>> recovery offset as candidates for bumping up event count") actually
>>> attempts to protect from that:)
>>>
>>> I don't understand why "it is just as valid to do so once some
>>> recovery has started". My understanding is that once recovery of a
>>> drive has started, its data is not consistent between different parts
>>> of the drive, until the recovery completes. This is because md does
>>> bitmap-based recovery, and not kind of journal/transaction-log based
>>> recovery.
>>>
>>> However, one could argue that for force-assembly case, when data
>>> anyways can come up as partially corrupted, this is less important.
>>
>> Exactly.  And mdadm only updates event counts in the force-assembly case so
>> while it might not be ideal, it is the best we can do.
>>
>>>
>>> I would still think that there is value in recoding in a superblock
>>> that a drive is recovering.
>>
>> Probably.  It is a bit unfortunate that if you stop an array that is
>> recovering after a --re-add, you cannot simply 'assemble' it again and
>> get it back to the same state.
>> I'll think more on that.
>>
>> Meanwhile, this patch might address your other problem.  It allows --re-add
>> to work if a non-bitmap rebuild fails and is then re-added.
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index c601c4b..d31852e 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
>>                         super_types[mddev->major_version].
>>                                 validate_super(mddev, rdev);
>>                 if ((info->state & (1<<MD_DISK_SYNC)) &&
>> -                   (!test_bit(In_sync, &rdev->flags) ||
>> +                   (test_bit(Faulty, &rdev->flags) ||
>>                      rdev->raid_disk != info->raid_disk)) {
>>                         /* This was a hot-add request, but events doesn't
>>                          * match, so reject it.
>>
>>
>>>
>>> >
>>> >>
>>> >> Can you pls comment on my doubts?
>>> >
>>> > I think there is probably room for improvement here but I don't think there
>>> > are any serious problems.
>>> >
>>> > However I'm about to go on leave for a couple of week so I'm unlikely to
>>> > think about it for a while. I've made a note to look at it properly when I
>>> > get back.
>>> >
>>>
>>> Indeed, don't you think about this while you are resting!
>>
>> :-)
>>
>>
>> Thanks.  I did have a very enjoyable break.
>>
>> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html