I just wanted to reply to this email - I spent some time looking at the code but then I got a job :( So mainly to *bump* this text in case anyone else is interested in this feature and has more time/skill than I. (unless Neil's roadmap was realised...) David Neil Brown wrote: > On Thursday May 14, david@xxxxxxxxxxxx wrote: >> Neil Brown wrote: >>> The one problem with this approach is that if there is a read error on >>> /dev/suspect while data is being copied to /dev/new, you lose. >>> >>> Hence the requested functionality which I do hope to implement for >>> raid456 and raid10 (it adds no value to raid1). >>> Maybe by the end of this year... it is on the roadmap. >> Neil, >> If you have ideas about how this should be accomplished then outlining them may >> provide a reasonable starting point for those new to the code; especially if >> there are any steps that you may clearly see that would help others to make a start. > > As I said in some other email recently, I think an important precursor > to this hot-replace functionality is to support a per-device bad-block > list. This allows a device to remain in an array even if a few blocks > have failed - only individual stripes will be degraded. > Then the hot-replace function can be used on only on drives that are > threatening bad blocks, but also on drives that have actually > delivered bad blocks. > > The procedure for effecting a hot-replace would then be: > - swap the suspect device for a no-metadata raid1 containing just > the suspect device (it's not clear to me yet exactly how this > will be managed but I have some ideas) > - add the new device to the raid1 > - enable an in-memory bad-block list for the raid1 > - allow a recovery that just recovers the data part of the > suspect device, not the metadata. Any read errors will simply add > to the bad block list > - For each entry in this suspect drive's bad-block-list, trigger > a resync of just that block in the top-level array. This involves > setting up 'low' and 'high' values via sysfs and writing 'repair' > to sync_action. > This should clear the entry from the bad block list. > - once the bad block list is clear ... sort out the metadata some > how, and swap the new device in place of the raid1. > > Getting the metadata right is the awkward bit. When the main array > writes metadata to the raid1, I don't want it to go the new drive > until the new drive actually have fully up-to-date data. > The only way I can think at the moment to make it work is to build a > raid1 from just the data parts of the two devices, and use a linear > array to combine that with the metadata parts of the suspect device > and give the linear array to the main device. That would work, but it > seems rather ugly, so I'm not convinced > > Anyway, the first step is getting a bad-block-list working. > > Below are some notes I wrote a while ago when someone else was showing > interest in a bad block list. Nothing has come of that yet. > It envisages the BBL being associated with an 'externally managed > metadata' array. For this purpose, I would want it also to work for > "no metadata" array, and possible for 1.x arrays with the kernel > writing the BBL to the device (maybe). > > ------------------- > I envisage these changes to the kernel: > 1/ store a BBL with each rdev, and make it available for read/write > through a sysfs file (or two). > It would probably be stored as an RB-tree or similar, The > assumption is that the log would normally be very small and > sparse. > > 2/ any READ request against a block that is listed in the BBL returns > a failure (or is detected by read-balancing and causes a different > device to be chosen). > > 3/ any WRITE request against a block in the BBL is attempted and if > it succeeds, the block is removed from the BBL. > > 4/ When recovery gets a read failure, it adds the block to the BBL > rather than trying to write it. > Adding a block to the BBL causes the sysfs file to report as > 'urgent-readable' to 'poll' (POLLPRI) thus allowing userspace to > find the new bad blocks and add them to the list on stable storage. > > 5/ When a write error causes a drive to be marked as > 'failed/blocked', userspace can either unblock and remove it (as > currently) or update the BBL with the offending blocks and > re-enable the drive. > > One difficulty is how to present the BBL through sysfs. > A sysfs file is limited to 4096 characters and we may want the BBL to > be large enough to exceed that. > I have an idea that entries in the BBL can be either 'acknowledged' or > 'unacknowledged'. Then the sysfs file lists the unacknowledged blocks > first. userspace can write to the sysfs file to acknowledge blocks, > which then allows other blocks to appear in the file. > > To read all the entries in the BBL, we could write a message that > means "mark all entries and unacknowledged", then read and acknowledge > until everything has been read. > > Alternately we could have a second file into which we can write the > address of the smallest block that we want to read from the main file. > > I'm assuming that the BBL would allow a granularity of 512 byte sectors. > ----------------------------------------------- > > The 'bbl' would be a library of code that each raid personality can > choose to make use, much like the bitmap.c code. > > I think that implementing bbl.c should be a reasonably manageable > project for someone with reasonable coding skills but minimal > knowledge of md. It would involve > - creating and maintaining the in-memory bbl > - providing access to it via sysfs > - providing appropriate interface routines for md/raidX to call. > > We would then need to define a way to enable a bbl on a given device. > I imagine the one sysfs file would serve. > The file '/sys/block/mdX/md/dev-foo/bbl' > initially reads a 'none' > If you write 'clear' to it, and empty bbl is created > If you write "+sector address", that address is added to it. > If it was already present, it gets 'acknowledged'. > If you write "-sector address", that address is removed > If you write "flush" (??) all entries get un-acknowleged > If you read, you get all the un-acknowleged address, in order, then > all the acknowledged addresses. > > It would be important that this does not slow IO down. So lookups > should be fast. > In most cases the list will be empty. In that case, the lookup must be > extremely fast (definitely no locking) > > Is that enough to get you started :-) > > NeilBrown > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- "Don't worry, you'll be fine; I saw it work in a cartoon once..." -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html