Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Wed, 19 Feb 2014 11:51:33 +0200

Hello Neil, Andrei,
I don't how much you recall of this old discussion.

Basically, the change that you made means that md doesn't update the
superblock on the recovering device, until recovery completes. As a
result, when assembling such array, the recovering device has an old
event count in the superblock and is not picked during assembly. So
later, user has to re-add it manually.
This is true for a device that failed and was re-added. For a fresh
device, saved_raid_disk==-1, so its superblock will still be updated
(and sucn device will be picked on assembly).

On the other hand, MD updates the superblock of the In_sync devices
and marks the recovering device as having a valid array slot (not
0xFFFF or 0xFFFE, which are treated similarly today).

What do you think of the following change:

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 561a65f..4bbc7e3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1854,8 +1854,6 @@ retry:
                        sb->dev_roles[i] = cpu_to_le16(0xfffe);
                else if (test_bit(In_sync, &rdev2->flags))
                        sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
-               else if (rdev2->raid_disk >= 0)
-                       sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
                else
                        sb->dev_roles[i] = cpu_to_le16(0xffff);
        }

Basically, we do not mark the slot of the recovering device, until its
recovery completes. So during assembly, we will not pick it, as
before. For a fresh drive, there will be a regression - we will not
pick it as well on assembly.

The reason I am considering this change is another old discussion that
we had - considering split-brain resolution, and the proposal I made
in:
https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE

Basically, when MD marks recovering devices as having a valid raid
slot in the superblock of In_sync devices, then on array assembly we
don't know whether sucn device is recovering or In_sync (if it is
inaccessible during assembly). So we have to assume it is In_sync and
thus can potentially cause split-brain.


What do you think of this change?

Thanks,
Alex.


On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@xxxxxxx> wrote:
> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
> wrote:
>
>> Thanks for commenting, Neil,
>>
>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@xxxxxxx> wrote:
>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
>> > wrote:
>> >
>> >> Hi again Neil, and Andrey,
>> >>
>> >> looking at this email thread:
>> >> http://www.spinics.net/lists/raid/msg36236.html
>> >> between you and Andrey, the conclusion was:
>> >> "So the correct thing to do is to *not* update the metadata on the
>> >> recovering device until recovery completes.  Then if it fails and is
>> >> re-added, it will look just the same as when it was re-added the first
>> >> time, and will do a bitmap-based recovery."
>> >>
>> >> I have two doubts about this decision:
>> >> 1) Since the event count on the recovering drive is not updated, this
>> >> means that after reboot, when array is re-assembled, this drive will
>> >> not be added to the array, and the user will have to manually re-add
>> >> it. I agree this is a minor thing.
>> >
>> > Still, if it can be fixed it should be.
>> >
>> >> 2) There are places in mdadm, which check for recovery_offset on the
>> >> drive and take decisions based upon that. Specifically, if there is
>> >> *no* recovery offset, the data on this drive is considered to be
>> >> consistent WRT to the failure time of the drive. So, for example, the
>> >> drive can be a candidate for bumping up events during "force
>> >> assembly". Now, when superblock on such drive is not updated during
>> >> recovery (so there is *no* recovery offset), mdadm will think that the
>> >> drive is consistent, while in fact, its data is totally unusable until
>> >> after recovery completes. That's because we have updated parts of the
>> >> drive, but did not complete bringing the whole drive in-sync.
>> >
>> > If mdadm would consider updating the event count if not recovery had started,
>> > then surely it is just as valid to do so once some recovery has started, even
>> > if it hasn't completed.
>>
>> The patch you accepted from me ("Don't consider disks with a valid
>> recovery offset as candidates for bumping up event count") actually
>> attempts to protect from that:)
>>
>> I don't understand why "it is just as valid to do so once some
>> recovery has started". My understanding is that once recovery of a
>> drive has started, its data is not consistent between different parts
>> of the drive, until the recovery completes. This is because md does
>> bitmap-based recovery, and not kind of journal/transaction-log based
>> recovery.
>>
>> However, one could argue that for force-assembly case, when data
>> anyways can come up as partially corrupted, this is less important.
>
> Exactly.  And mdadm only updates event counts in the force-assembly case so
> while it might not be ideal, it is the best we can do.
>
>>
>> I would still think that there is value in recoding in a superblock
>> that a drive is recovering.
>
> Probably.  It is a bit unfortunate that if you stop an array that is
> recovering after a --re-add, you cannot simply 'assemble' it again and
> get it back to the same state.
> I'll think more on that.
>
> Meanwhile, this patch might address your other problem.  It allows --re-add
> to work if a non-bitmap rebuild fails and is then re-added.
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c601c4b..d31852e 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
>                         super_types[mddev->major_version].
>                                 validate_super(mddev, rdev);
>                 if ((info->state & (1<<MD_DISK_SYNC)) &&
> -                   (!test_bit(In_sync, &rdev->flags) ||
> +                   (test_bit(Faulty, &rdev->flags) ||
>                      rdev->raid_disk != info->raid_disk)) {
>                         /* This was a hot-add request, but events doesn't
>                          * match, so reject it.
>
>
>>
>> >
>> >>
>> >> Can you pls comment on my doubts?
>> >
>> > I think there is probably room for improvement here but I don't think there
>> > are any serious problems.
>> >
>> > However I'm about to go on leave for a couple of week so I'm unlikely to
>> > think about it for a while. I've made a note to look at it properly when I
>> > get back.
>> >
>>
>> Indeed, don't you think about this while you are resting!
>
> :-)
>
>
> Thanks.  I did have a very enjoyable break.
>
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html