On Wed, 2011-10-26 at 12:43 +1100, NeilBrown wrote: > The following series - on top of my for-linus branch which should appear in > 3.2-rc1 eventually - implements hot-replace for RAID4/5/6. This is almost > certainly the most requested feature over the last few years. > The whole series can be pulled from my md-devel branch: > git://neil.brown.name/md md-devel > (please don't do a full clone, it is not a very fast link). > > There is currently no mdadm support, but you can test it out and > experiment without mdadm. > > In order to activate hot-replace you need to mark the device as > 'replaceable'. > This happens automatically when a write error is recorded in a > bad-block log (if you happen to have one). > It can be achieved manually by > echo replaceable > /sys/block/mdXX/md/dev-YYY/state > > This makes YYY, in XX, replaceable. > > If md notices that there is a replaceable drive and a spare it will > attach the spare to the replaceable drive and mark it as a > 'replacement'. > This word appears in the 'state' file and as (R) in /proc/mdstat. > > md will then copy data from the replaceable drive to the replacement. > If there is a bad block on the replaceable drive, it will get the data > from elsewhere. This looks like a "recovery" operation. > > When the replacement completes the replaceable device will be marked > as Failed and will be disconnected from the array (i.e. the 'slot' > will be set to 'none') and the replacement drive will take up full > possession of that slot. Neil, Seems to work quite well. Note I have not yet performed a data consistency check, just the mechanics of 'replacing' an existing drive. I see in the code that a recovery is kicked immediately after changing the state of a drive. One question is whether it will be possible to mark multiple drives for replacement, then invoke the recovery one time, replacing all disks marked in a single pass? Right now, it changing state on multiple drives kicks off sequential recoveries. For larger disks (3TB/etc), recovery takes a long time and there is a non-zero performance hit on the live array. There are two common use cases to think about. First being an array disk replacement to (say) larger disks. Second being a new array in use for a period of time where the disks are approaching end-of-life, and multiple disks are showing signs of possible failure. So we want to replace a number of them at one time and incur the performance hit one time. I see where the code limits a recovery to one sync at a time, would it be possible to extend this to multiple concurrent replacements? What would it take to enable this? Thanks again for this effort, this is terrific. Best, -PWM > > It is not possible to assemble an array with replacement with mdadm. > To do this by hand: > > mknod /dev/md27 b 9 27 > < /dev/md27 > cd /sys/block/md27/md > echo 1.2 > metadata_version > echo 8:1 > new_dev > echo 8:17 > new_dev > ... > echo active > array_state > > Replace '27' by the md number you want. Replace 1.2 by the metadata > version number (must be 1.x for some x). Replace 8:1, 8:17 etc > by the major:minor numbers of each device in the array. > > Yes: this is clumsy. But they you aren't doing this on live data - > only on test devices to experiment. > > You can still assemble the array without the replacement using mdadm. > Just list all the drives except the replacement in the --assemble > command. > Also once the replacement operation completes you can of course stop > and assemble the new array with old mdadm. > > I hope to submit this together with support for RAID10 (and maybe some > minimal support for RAID1) for Linux-3.3. By the time it comes out > mdadm-3.3 should exist will full support for hot-replace. > > Review and testing is very welcome, be please do not try it on live > data. > > NeilBrown > > > --- > > NeilBrown (16): > md/raid5: Mark device replaceable when we see a write error. > md/raid5: If there is a spare and a replaceable device, start replacement. > md/raid5: recognise replacements when assembling array. > md/raid5: handle activation of replacement device when recovery completes. > md/raid5: detect and handle replacements during recovery. > md/raid5: writes should get directed to replacement as well as original. > md/raid5: allow removal for failed replacement devices. > md/raid5: preferentially read from replacement device if possible. > md/raid5: remove redundant bio initialisations. > md/raid5: raid5.h cleanup > md/raid5: allow each slot to have an extra replacement device > md: create externally visible flags for supporting hot-replace. > md: change hot_remove_disk to take an rdev rather than a number. > md: remove test for duplicate device when setting slot number. > md: take after reference to mddev during sysfs access. > md: refine interpretation of "hold_active == UNTIL_IOCTL". > > > Documentation/md.txt | 22 ++ > drivers/md/md.c | 132 ++++++++++--- > drivers/md/md.h | 82 +++++--- > drivers/md/multipath.c | 7 - > drivers/md/raid1.c | 7 - > drivers/md/raid10.c | 7 - > drivers/md/raid5.c | 462 +++++++++++++++++++++++++++++++++++---------- > drivers/md/raid5.h | 98 +++++----- > include/linux/raid/md_p.h | 7 - > 9 files changed, 599 insertions(+), 225 deletions(-) > > -- > Signature > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html