Re: [md PATCH 00/16] hot-replace support for RAID4/5/6

NeilBrown <neilb@xxxxxxx> · Thu, 15 Dec 2011 17:18:50 +1100

On Wed, 14 Dec 2011 14:18:51 -0800 Dan Williams <dan.j.williams@xxxxxxxxx>
wrote:

> On Tue, Oct 25, 2011 at 6:43 PM, NeilBrown <neilb@xxxxxxx> wrote:
> > The following series - on top of my for-linus branch which should appear in
> > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6.  This is almost
> > certainly the most requested feature over the last few years.
> > The whole series can be pulled from my md-devel branch:
> >   git://neil.brown.name/md md-devel
> > (please don't do a full clone, it is not a very fast link).
> 
> Some belated comments based on the commit ids at the time:
> 
> 88eeb3d md: refine interpretation of "hold_active == UNTIL_IOCTL".
> 9c22832 md: take a reference to mddev during sysfs access.
> a7d6ae4 md: remove test for duplicate device when setting slot number.
> 6deecf2 md: change hot_remove_disk to take an rdev rather than a number.
> 
> last 4 reviewed-by.

Thanks.  I've annotated the two that haven't gone upstream yet.

> 
> f248f8c md: create externally visible flags for supporting hot-replace.
> 
> 'replaceable' just strikes me as a confusing name as all devices are
> nominally "replaceable", but whether you want it to be actively
> replaced is a different consideration.  What about "incumbent" to mark
> the disk as currently holding a position we want it to vacate and
> remove any potential confusion with 'replacement'.

Fair point.  I had wondered if I should not have the flag and just use the
"write_error" flag.  However the meaning is slightly different.

I don't really like "incumbent" as it gives no indication that there is a
desire to replace the device.  Maybe "want_replacement" ??

> 
> ce8fd05 md/raid5: allow each slot to have an extra replacement device
> fd7557d md/raid5: raid5.h cleanup
> 15e9a58 md/raid5: remove redundant bio initialisations.
> 
> last 3 reviewed-by.

Thanks.

> 
> 37aebb5 md/raid5: preferentially read from replacement device if possible.
> 
> +                       /* This flag does not apply to '.replacement'
> +                        * only to .rdev, so make sure to check that*/
> +                       struct md_rdev *rdev2 = rcu_dereference(
> +                               conf->disks[i].rdev);
> +                       if (rdev2 == rdev)
> +                               clear_bit(R5_Insync, &dev->flags);
> +                       if (!test_bit(Faulty, &rdev2->flags)) {
> 
> can't rdev2 be NULL here?

Uhm... probably.  I've added a test for rdev2 like I have in the "MadeGood"
case below.

Thanks.

> 
> @@ -4201,7 +4241,6 @@ static int  retry_aligned_read(struct r5conf
> *conf, struct bio *raid_bio)
>                         return handled;
>                 }
> 
> -               set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
>                 if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
>                         release_stripe(sh);
>                         raid5_set_bi_hw_segments(raid_bio, scnt);
> 
> 
> Should this one liner be broken out for -stable?

Uhmm... maybe.  If the array is degraded we'll hit problems soon anyway, and
if it isn't, the read-errors will all soon be fixed up.

Do you see a particular problem that this fixes that is already possible
without hot-replace?

> 
> 8e2c0f9 md/raid5: allow removal for failed replacement devices.
> 17df00a md/raid5: writes should get directed to replacement as well as original.
> 
> last 2 reviewed-by

Thanks.

> 
> dba5a681 md/raid5:  detect and handle replacements during recovery.
> 
> This one got me looking back to recall the rules about when
> rcu_deference must be used for an rdev (the ones outlined in commit
> 9910f16a "md: fix up some rdev rcu locking in raid5/6").  But the
> casual future reader may have a hard time finding that commit.  Maybe
> we could introduce our own rdev_deref() macro so that sparse and
> lockdep can automatically validate rdev derefences like below.
> 
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 8d8e139..6023583 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -357,9 +357,14 @@ enum {
> 
> 
>  struct disk_info {
> -       struct md_rdev  *rdev, *replacement;
> +       struct md_rdev __rcu *rdev,
> +       struct md_rdev __rcu *replacement;
>  };
> 
> +#define rdev_deref(p, md, sh) \
> +       rcu_dereference_check((p), (md) ? mddev_is_locked(md) : 1 || \
> +                                  (sh) ? test_bit(STRIPE_SYNCING,
> &(sh)->state) : 1)
> +
>  struct r5conf {
>         struct hlist_head       *stripe_hashtbl;
>         struct mddev            *mddev;
> 
> ...but not sure if it's worth the code uglification.

No, I'm not sure either... If it comes up again I might...

> 
> 
> Nit, not sure if it's worth fixing but this one introduces some
> inconsistent line wrapping around logical operators... "at the end" vs
> "beginning of next line"
> 
> +               if (rdev
> +                   && !test_bit(Faulty, &rdev->flags)
> +                   && !test_bit(In_sync, &rdev->flags)
> +                   && !rdev_set_badblocks(rdev, sh->sector,
> +                                          STRIPE_SECTORS, 0))
> +                       abort = 1;
> +               rdev = conf->disks[i].replacement;
> +               if (rdev
> +                   && !test_bit(Faulty, &rdev->flags)
> +                   && !test_bit(In_sync, &rdev->flags)
> +                   && !rdev_set_badblocks(rdev, sh->sector,
> +                                          STRIPE_SECTORS, 0))
>                         abort = 1;
>         }
>         if (abort) {
> @@ -2456,6 +2475,22 @@ handle_failed_sync(struct r5conf *conf, struct
> stripe_head *sh,
>         }
>  }
> 
> +static int want_replace(struct stripe_head *sh, int disk_idx)
> +{
> +       struct md_rdev *rdev;
> +       int rv = 0;
> +       /* Doing recovery so rcu locking not required */
> +       rdev = sh->raid_conf->disks[disk_idx].replacement;
> +       if (rdev &&
> +           !test_bit(Faulty, &rdev->flags) &&
> +           !test_bit(In_sync, &rdev->flags) &&
> +           (rdev->recovery_offset <= sh->sector ||
> +            rdev->mddev->recovery_cp <= sh->sector))
> +               rv = 1;
> +
> +       return rv;

Thanks.
I almost always prefer 'at the start' as import things should be obvious.
So I have updated 'want_replace'.

> 
> 2693b9e md/raid5: handle activation of replacement device when
> recovery completes.
> 
> I questioned not needing a barrier in raid5_end_write_request after
> finding conf->disks[i].replacement == NULL until I found the note in
> raid5_end_read_request about the rdev being pinned until all i/o
> returns.  Maybe a similar note to raid5_end_write_request?

I like adding explanatory notes ... but I'm not quite sure what you are
suggesting here.  Could you be a little more explicit?  Thanks.

> 
> d6db3d0 md/raid5: recognise replacements when assembling array.
> 6cdb4fb md/raid5: If there is a spare and a replaceable device, start
> replacement.
> 0124565 md/raid5: Mark device replaceable when we see a write error.
> 
> last 3 reviewed-by.

Thanks.

> 
> 058c478..678a66d
> raid10 and raid1 patches not reviewed.

That's what a Christmas break is for, isn't it??

Thanks for all the review - I really appreciate it.

NeilBrown
Attachment:
signature.asc

Description: PGP signature