Re: [md PATCH 00/16] hot-replace support for RAID4/5/6

"Peter W. Morreale" <morreale@xxxxxxx> · Thu, 27 Oct 2011 11:10:34 -0600

On Wed, 2011-10-26 at 12:43 +1100, NeilBrown wrote: 
> The following series - on top of my for-linus branch which should appear in
> 3.2-rc1 eventually - implements hot-replace for RAID4/5/6.  This is almost
> certainly the most requested feature over the last few years.
> The whole series can be pulled from my md-devel branch:
>    git://neil.brown.name/md md-devel
> (please don't do a full clone, it is not a very fast link).
> 
> There is currently no mdadm support, but you can test it out and
> experiment without mdadm.
> 
> In order to activate hot-replace you need to mark the device as
> 'replaceable'.
> This happens automatically when a write error is recorded in a
> bad-block log (if you happen to have one).
> It can be achieved manually by
>    echo replaceable > /sys/block/mdXX/md/dev-YYY/state
> 
> This makes YYY, in XX, replaceable.
> 
> If md notices that there is a replaceable drive and a spare it will
> attach the spare to the replaceable drive and mark it as a
> 'replacement'.
> This word appears in the 'state' file and as (R) in /proc/mdstat.
> 
> md will then copy data from the replaceable drive to the replacement.
> If there is a bad block on the replaceable drive, it will get the data
> from elsewhere.  This looks like a "recovery" operation.
> 
> When the replacement completes the replaceable device will be marked
> as Failed and will be disconnected from the array (i.e. the 'slot'
> will be set to 'none') and the replacement drive will take up full
> possession of that slot.

Neil,

Seems to work quite well.  Note I have not yet performed a data
consistency check, just the mechanics of 'replacing' an existing
drive.  

I see in the code that a recovery is kicked immediately after changing
the state of a drive.  One question is whether it will be possible to
mark multiple drives for replacement, then invoke the recovery one time,
replacing all disks marked in a single pass?

Right now, it changing state on multiple drives kicks off sequential
recoveries.  For larger disks (3TB/etc), recovery takes a long time and
there is a non-zero performance hit on the live array.

There are two common use cases to think about.  First being an array
disk replacement to (say) larger disks.  Second being a new array in use
for a period of time where the disks are approaching end-of-life, and
multiple disks are showing signs of possible failure.  So we want to
replace a number of them at one time and incur the performance hit one
time. 

I see where the code limits a recovery to one sync at a time, would it
be possible to extend this to multiple concurrent replacements?

What would it take to enable this?

Thanks again for this effort, this is terrific. 

Best,
-PWM

> 
> It is not possible to assemble an array with replacement with mdadm.
> To do this by hand:
> 
>   mknod /dev/md27 b 9 27
>   < /dev/md27
>   cd /sys/block/md27/md
>   echo 1.2 > metadata_version
>   echo 8:1 > new_dev
>   echo 8:17 > new_dev
>    ...
>   echo active > array_state
> 
> Replace '27' by the md number you want.  Replace 1.2 by the metadata
> version number (must be 1.x for some x).  Replace 8:1, 8:17 etc
> by the major:minor numbers of each device in the array.
> 
> Yes: this is clumsy.  But they you aren't doing this on live data -
> only on test devices to experiment.
> 
> You can still assemble the array without the replacement using mdadm.
> Just list all the drives except the replacement in the --assemble
> command.
> Also once the replacement operation completes you can of course stop
> and assemble the new array with old mdadm.
> 
> I hope to submit this together with support for RAID10 (and maybe some
> minimal support for RAID1) for Linux-3.3. By the time it comes out
> mdadm-3.3 should exist will full support for hot-replace.
> 
> Review and testing is very welcome, be please do not try it on live
> data.
> 
> NeilBrown
> 
> 
> ---
> 
> NeilBrown (16):
>       md/raid5: Mark device replaceable when we see a write error.
>       md/raid5: If there is a spare and a replaceable device, start replacement.
>       md/raid5: recognise replacements when assembling array.
>       md/raid5: handle activation of replacement device when recovery completes.
>       md/raid5:  detect and handle replacements during recovery.
>       md/raid5: writes should get directed to replacement as well as original.
>       md/raid5: allow removal for failed replacement devices.
>       md/raid5: preferentially read from replacement device if possible.
>       md/raid5: remove redundant bio initialisations.
>       md/raid5: raid5.h cleanup
>       md/raid5: allow each slot to have an extra replacement device
>       md: create externally visible flags for supporting hot-replace.
>       md: change hot_remove_disk to take an rdev rather than a number.
>       md: remove test for duplicate device when setting slot number.
>       md: take after reference to mddev during sysfs access.
>       md: refine interpretation of "hold_active == UNTIL_IOCTL".
> 
> 
>  Documentation/md.txt      |   22 ++
>  drivers/md/md.c           |  132 ++++++++++---
>  drivers/md/md.h           |   82 +++++---
>  drivers/md/multipath.c    |    7 -
>  drivers/md/raid1.c        |    7 -
>  drivers/md/raid10.c       |    7 -
>  drivers/md/raid5.c        |  462 +++++++++++++++++++++++++++++++++++----------
>  drivers/md/raid5.h        |   98 +++++-----
>  include/linux/raid/md_p.h |    7 -
>  9 files changed, 599 insertions(+), 225 deletions(-)
> 
> -- 
> Signature
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html