Re: [RFC] MD: Allow restarting an interrupted incremental recovery.

Andrei Warkentin <awarkentin@xxxxxxxxxx> · Mon, 17 Oct 2011 10:26:44 -0700 (PDT)

Hi Neil,

----- Original Message -----
> From: "NeilBrown" <neilb@xxxxxxx>
> To: "Andrei E. Warkentin" <andrey.warkentin@xxxxxxxxx>
> Cc: "Andrei Warkentin" <andreiw@xxxxxxxxxx>, linux-raid@xxxxxxxxxxxxxxx
> Sent: Sunday, October 16, 2011 11:20:39 PM
> Subject: Re: [RFC] MD: Allow restarting an interrupted incremental recovery.
> 
> On Thu, 13 Oct 2011 21:18:43 -0400 "Andrei E. Warkentin"
> <andrey.warkentin@xxxxxxxxx> wrote:
> 
> > 2011/10/12 Andrei Warkentin <andreiw@xxxxxxxxxx>:
> > > If an incremental recovery was interrupted, a subsequent
> > > re-add will result in a full recovery, even though an
> > > incremental should be possible (seen with raid1).
> > >
> > > Solve this problem by not updating the superblock on the
> > > recovering device until array is not degraded any longer.
> > >
> > > Cc: Neil Brown <neilb@xxxxxxx>
> > > Signed-off-by: Andrei Warkentin <andreiw@xxxxxxxxxx>
> > 
> > FWIW it appears to me that this idea seems to work well, for the
> > following reasons:
> > 
> > 1) The recovering sb is not touched until the array is not degraded
> > (only for incremental sync).
> > 2) The events_cleared count isn't updated in the active bitmap sb
> > until array is not degraded. This implies that if the incremental
> > was
> > interrupted, recovering_sb->events is NOT less than
> > active_bitmap->events_cleared).
> > 3) The bitmaps (and sb) are updated on all drives at all times as
> > it
> > were before.
> > 
> > How I tested it:
> > 1) Create RAID1 array with bitmap.
> > 2) Degrade array by removing a drive.
> > 3) Write a bunch of data (Gigs...)
> > 4) Re-add removed drive - an incremental recovery is started.
> > 5) Interrupt the incremental.
> > 6) Write some more data.
> > 7) MD5sum the data.
> > 8) Re-add removed drive - and incremental recovery is restarted (I
> > verified it starts at sec 0, just like you mentioned it should be,
> > to
> > avoid consistency issues). Verified that, indeed, only changed
> > blocks
> > (as noted by write-intent) are synced.
> > 10)  Remove other half.
> > 11) MD5sum data - hashes match.
> > 
> > Without this fix, you would of course have to deal with a full
> > resync
> > after the interrupted incremental.
> > 
> > Is there anything you think I'm missing here?
> > 
> > A
> 
> Not much, it looks good, and your testing is of course a good sign.
> 
> My only thought is whether we really need the new InIncremental flag.
> You set it exactly when saved_raid_disk is set, and clear it exactly
> when
> saved_raid_disk is cleared (set to -1).
> So maybe we can just used saved_raid_disk.

You're right, I looked at the code paths again (for non-bitmapped disks), and there is
no reason not to leverage saved_raid_disk. It should be safe (i.e. not missing necessary sb fluesh) 
for non-bitmapped disks - a disk resuming recovery (from rec_offset) will never have In_sync, and raid_saved_disk
will not be set. For multipath this is irrelevant as well.

> If you look at it that way, you might notice that saved_raid_disk is
> also set
> in slot_store, so probably InIncremental should be set there.  So
> that might
> be the one thing you missed.
>

Actually, maybe you can clarify something a bit about that code? The part where
slot != -1 and mddev->pers != NULL looks a lot like the add_new_disk path - except:

After pers->hot_add_disk:
1) rdev In_sync isn't cleared
2) Array isn't checked for degraded state and MD_RECOVERY_RECOVER isn't conditionally set.
3) MD_RECOVERY_NEEDED isn't set.
4) mddev->thread is't woken up.

Is this because if an array was degraded AND there were extra spares, they would already be
assigned to the array?

> Could you respin the patch without adding InIncremental, and testing
>   rdev->saved_raid_disk >= 0
> instead, check if you agree that should work, and perform a similar
> test?
> (Is that asking too much?).
> 

Absolutely! Thanks again!

A
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html