Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)

NeilBrown <neilb@xxxxxxx> · Thu, 7 Jun 2012 21:24:34 +1000

On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
wrote:

> Hi again Neil, and Andrey,
> 
> looking at this email thread:
> http://www.spinics.net/lists/raid/msg36236.html
> between you and Andrey, the conclusion was:
> "So the correct thing to do is to *not* update the metadata on the
> recovering device until recovery completes.  Then if it fails and is
> re-added, it will look just the same as when it was re-added the first
> time, and will do a bitmap-based recovery."
> 
> I have two doubts about this decision:
> 1) Since the event count on the recovering drive is not updated, this
> means that after reboot, when array is re-assembled, this drive will
> not be added to the array, and the user will have to manually re-add
> it. I agree this is a minor thing.

Still, if it can be fixed it should be.

> 2) There are places in mdadm, which check for recovery_offset on the
> drive and take decisions based upon that. Specifically, if there is
> *no* recovery offset, the data on this drive is considered to be
> consistent WRT to the failure time of the drive. So, for example, the
> drive can be a candidate for bumping up events during "force
> assembly". Now, when superblock on such drive is not updated during
> recovery (so there is *no* recovery offset), mdadm will think that the
> drive is consistent, while in fact, its data is totally unusable until
> after recovery completes. That's because we have updated parts of the
> drive, but did not complete bringing the whole drive in-sync.

If mdadm would consider updating the event count if not recovery had started,
then surely it is just as valid to do so once some recovery has started, even
if it hasn't completed.

> 
> Can you pls comment on my doubts?

I think there is probably room for improvement here but I don't think there
are any serious problems.

However I'm about to go on leave for a couple of week so I'm unlikely to
think about it for a while. I've made a note to look at it properly when I
get back.

NeilBrown

> 
> Thanks,
> Alex.
> 
> 
> On Wed, Jun 6, 2012 at 7:09 PM, Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote:
> > Hi Neil,
> > it looks like we are hitting an issue with this patch.
> > Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are
> > aware that this version of mdadm has an issue of unaligned writes, so
> > we backported the fix to it into this version).
> >
> > We are testing the following simple scenario:
> > # 2-way raid1 with a missing drive
> > # a fresh drive (no superblock) is added (not re-added) to the raid,
> > and the raid happily starts rebuilding it
> > # at some point in time, this drive fails, so rebuild aborts, drive is
> > kicked out.
> > # later the drive recovers and "mdadm remove & re-add" is attempted
> > # re-add fails
> >
> > Some debugging suggests the following:
> >
> > - The superblock of the drive to be re-added looks like this:
> >
> >  Avail Dev Size : 10483712 (5.00 GiB 5.37 GB)
> >     Array Size : 10483688 (5.00 GiB 5.37 GB)
> >  Used Dev Size : 10483688 (5.00 GiB 5.37 GB)
> >    Data Offset : 2048 sectors
> >   Super Offset : 8 sectors
> > Recovery Offset : 100096 sectors
> >          State : clean
> >  Resync Offset : N/A sectors
> >    Device UUID : d9114d02:76153894:87aaf09d:c55da9be
> > Internal Bitmap : 8 sectors from superblock
> >    Update Time : Wed Jun  6 18:36:55 2012
> >       Checksum : 94468c9d - correct
> >         Events : 192
> >   Device Role : Active device 1
> >   Array State : AA ('A' == active, '.' == missing)
> >
> >
> > So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET
> > set) and a valid role.
> >
> > - The "re-add" path of mdadm performs calls getinfo_super1(), which performs:
> > role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]);
> > ...
> > info->disk.state = 6; /* active and in sync */
> > info->disk.raid_disk = role;
> >
> > So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set.
> >
> > - mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel
> > - the kernel performs:
> > if (!mddev->persistent) {
> > ... // not relevant to our case
> > } else
> >        super_types[mddev->major_version].
> >                validate_super(mddev, rdev);
> >
> > - validate_super() performs:
> >        if ((le32_to_cpu(sb->feature_map) &
> >             MD_FEATURE_RECOVERY_OFFSET))
> >                rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
> >        else
> >                set_bit(In_sync, &rdev->flags);
> >        rdev->raid_disk = role;
> >
> > So In_sync flag is not set, because we have a recovery offset.
> >
> > - Then the code introduced by the patch does:
> >        if ((info->state & (1<<MD_DISK_SYNC)) &&
> >            (!test_bit(In_sync, &rdev->flags) ||
> >             rdev->raid_disk != info->raid_disk)) {
> >                /* This was a hot-add request, but events doesn't
> >                 * match, so reject it.
> >                 */
> >                export_rdev(rdev);
> >                return -EINVAL;
> >        }
> >
> > So the first part of "OR" succeeds and we abort.
> >
> > An observation: this problem only happens with a fresh drive
> > apparently. If drive that was already In_sync fails and then is
> > re-added, kernel does not update its superblock until recovery
> > completes. So its superblock has no valid recovery_offset, and this
> > problem doesn't happen. The code that skips updating the superblock
> > was added as part of this email thread:
> > http://www.spinics.net/lists/raid/msg36275.html. However, I did not
> > dig deeper why the superblock gets updated in my case. But the
> > scenario above repros easily.
> >
> > Q1: Can you confirm that my analysis is correct?
> > Q2: Is this an expected behavior? I would assume that no, because the
> > same case for a failed drive (not a fresh drive) works fine
> > Q3: How would you recommend to handle this?
> >
> > Thanks!
> > Alex.

Attachment:
signature.asc

Description: PGP signature