On Wed, 14 Dec 2011 14:18:51 -0800 Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > On Tue, Oct 25, 2011 at 6:43 PM, NeilBrown <neilb@xxxxxxx> wrote: > > The following series - on top of my for-linus branch which should appear in > > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost > > certainly the most requested feature over the last few years. > > The whole series can be pulled from my md-devel branch: > > git://neil.brown.name/md md-devel > > (please don't do a full clone, it is not a very fast link). > > Some belated comments based on the commit ids at the time: > > 88eeb3d md: refine interpretation of "hold_active == UNTIL_IOCTL". > 9c22832 md: take a reference to mddev during sysfs access. > a7d6ae4 md: remove test for duplicate device when setting slot number. > 6deecf2 md: change hot_remove_disk to take an rdev rather than a number. > > last 4 reviewed-by. Thanks. I've annotated the two that haven't gone upstream yet. > > f248f8c md: create externally visible flags for supporting hot-replace. > > 'replaceable' just strikes me as a confusing name as all devices are > nominally "replaceable", but whether you want it to be actively > replaced is a different consideration. What about "incumbent" to mark > the disk as currently holding a position we want it to vacate and > remove any potential confusion with 'replacement'. Fair point. I had wondered if I should not have the flag and just use the "write_error" flag. However the meaning is slightly different. I don't really like "incumbent" as it gives no indication that there is a desire to replace the device. Maybe "want_replacement" ?? > > ce8fd05 md/raid5: allow each slot to have an extra replacement device > fd7557d md/raid5: raid5.h cleanup > 15e9a58 md/raid5: remove redundant bio initialisations. > > last 3 reviewed-by. Thanks. > > 37aebb5 md/raid5: preferentially read from replacement device if possible. > > + /* This flag does not apply to '.replacement' > + * only to .rdev, so make sure to check that*/ > + struct md_rdev *rdev2 = rcu_dereference( > + conf->disks[i].rdev); > + if (rdev2 == rdev) > + clear_bit(R5_Insync, &dev->flags); > + if (!test_bit(Faulty, &rdev2->flags)) { > > can't rdev2 be NULL here? Uhm... probably. I've added a test for rdev2 like I have in the "MadeGood" case below. Thanks. > > @@ -4201,7 +4241,6 @@ static int retry_aligned_read(struct r5conf > *conf, struct bio *raid_bio) > return handled; > } > > - set_bit(R5_ReadError, &sh->dev[dd_idx].flags); > if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) { > release_stripe(sh); > raid5_set_bi_hw_segments(raid_bio, scnt); > > > Should this one liner be broken out for -stable? Uhmm... maybe. If the array is degraded we'll hit problems soon anyway, and if it isn't, the read-errors will all soon be fixed up. Do you see a particular problem that this fixes that is already possible without hot-replace? > > 8e2c0f9 md/raid5: allow removal for failed replacement devices. > 17df00a md/raid5: writes should get directed to replacement as well as original. > > last 2 reviewed-by Thanks. > > dba5a681 md/raid5: detect and handle replacements during recovery. > > This one got me looking back to recall the rules about when > rcu_deference must be used for an rdev (the ones outlined in commit > 9910f16a "md: fix up some rdev rcu locking in raid5/6"). But the > casual future reader may have a hard time finding that commit. Maybe > we could introduce our own rdev_deref() macro so that sparse and > lockdep can automatically validate rdev derefences like below. > > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h > index 8d8e139..6023583 100644 > --- a/drivers/md/raid5.h > +++ b/drivers/md/raid5.h > @@ -357,9 +357,14 @@ enum { > > > struct disk_info { > - struct md_rdev *rdev, *replacement; > + struct md_rdev __rcu *rdev, > + struct md_rdev __rcu *replacement; > }; > > +#define rdev_deref(p, md, sh) \ > + rcu_dereference_check((p), (md) ? mddev_is_locked(md) : 1 || \ > + (sh) ? test_bit(STRIPE_SYNCING, > &(sh)->state) : 1) > + > struct r5conf { > struct hlist_head *stripe_hashtbl; > struct mddev *mddev; > > ...but not sure if it's worth the code uglification. No, I'm not sure either... If it comes up again I might... > > > Nit, not sure if it's worth fixing but this one introduces some > inconsistent line wrapping around logical operators... "at the end" vs > "beginning of next line" > > + if (rdev > + && !test_bit(Faulty, &rdev->flags) > + && !test_bit(In_sync, &rdev->flags) > + && !rdev_set_badblocks(rdev, sh->sector, > + STRIPE_SECTORS, 0)) > + abort = 1; > + rdev = conf->disks[i].replacement; > + if (rdev > + && !test_bit(Faulty, &rdev->flags) > + && !test_bit(In_sync, &rdev->flags) > + && !rdev_set_badblocks(rdev, sh->sector, > + STRIPE_SECTORS, 0)) > abort = 1; > } > if (abort) { > @@ -2456,6 +2475,22 @@ handle_failed_sync(struct r5conf *conf, struct > stripe_head *sh, > } > } > > +static int want_replace(struct stripe_head *sh, int disk_idx) > +{ > + struct md_rdev *rdev; > + int rv = 0; > + /* Doing recovery so rcu locking not required */ > + rdev = sh->raid_conf->disks[disk_idx].replacement; > + if (rdev && > + !test_bit(Faulty, &rdev->flags) && > + !test_bit(In_sync, &rdev->flags) && > + (rdev->recovery_offset <= sh->sector || > + rdev->mddev->recovery_cp <= sh->sector)) > + rv = 1; > + > + return rv; Thanks. I almost always prefer 'at the start' as import things should be obvious. So I have updated 'want_replace'. > > 2693b9e md/raid5: handle activation of replacement device when > recovery completes. > > I questioned not needing a barrier in raid5_end_write_request after > finding conf->disks[i].replacement == NULL until I found the note in > raid5_end_read_request about the rdev being pinned until all i/o > returns. Maybe a similar note to raid5_end_write_request? I like adding explanatory notes ... but I'm not quite sure what you are suggesting here. Could you be a little more explicit? Thanks. > > d6db3d0 md/raid5: recognise replacements when assembling array. > 6cdb4fb md/raid5: If there is a spare and a replaceable device, start > replacement. > 0124565 md/raid5: Mark device replaceable when we see a write error. > > last 3 reviewed-by. Thanks. > > 058c478..678a66d > raid10 and raid1 patches not reviewed. That's what a Christmas break is for, isn't it?? Thanks for all the review - I really appreciate it. NeilBrown
Attachment:
signature.asc
Description: PGP signature