On Wed, Mar 24, 2010 at 5:35 PM, Neil Brown <neilb@xxxxxxx> wrote: > > Greetings. > I find myself in the middle of two separate off-list conversations on the > same topic and it has reached the point where I think the conversations > really need to be unite and brought on-list. > > So here is my current understanding and thoughts. > > The topic is about making rebuild after a failure easier. It strikes me as > particularly relevant after the link Bill Davidsen recently forwards to the > list: > > http://blogs.techrepublic.com.com/opensource/?p=1368 > > The most significant thing I got from this was a complain in the comments > that managing md raid was too complex and hence error-prone. > > I see the issue as breaking down in to two parts. > 1/ When a device is hot plugged into the system, is md allowed to use it as > a spare for recovery? > 2/ If md has a spare device, what set of arrays can it be used in if needed. > > A typical hot plug event will need to address both of these questions in > turn before recovery actually starts. > > Part 1. > > A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF, > other vendor metadata) or LVM or a filesystem. It might have a partition > table which could be subordinate to or super-ordinate to other metadata. > (i.e. RAID in partitions, or partitions in RAID). The metadata may or may > not be stale. It may or may not match - either strongly or weakly - > metadata on devices in currently active arrays. > > A newly hotplugged device also has a "path" which we can see > in /dev/disk/by-path. This is somehow indicative of a physical location. > This path may be the same as the path of a device which was recently > removed. It might be one of a set of paths which make up a "RAID chassis". > It might be one of a set of paths one which we happen to find other RAID > arrays. > > Some how from all of that information we need to decide if md can use the > device without asking, or possibly with a simple yes/no question, and we > need to decide what to actually do with the device. > > Options for what to do with the device include: > - write an MBR and partition table, then do something as below with > each partition > - include the device (or partition) in an array that it was previously > part of, but from which it was removed > - include the device or partition as a spare in a native-metadata array. > - add the device as a spare to a vendor-metadata array > > Part 2. > > If we have a spare device and a degraded array we need to know if it is OK > to add the device as a hot-spare to that array. > Currently this is handled (for native metadata) by 'mdadm --monitor' and > the spare-groups tag in mdadm.conf. > For vendor metadata, if the spare is already in the container then mdmon > should handle the spare assignment, but if the spare is in a different > container, 'mdadm --monitor' should move it to the right container, but > doesn't yet. > > The "spare-group" functionality works but isn't necessarily the easiest > way to express the configuration desires. People are likely to want to > specify how far a global spare can migrate using physical address: path. > > So for example you might specify a group of paths with wildcards with the > implication that all arrays which contain disks from this group of paths > are automatically in the same spare-group. > > > Configuration and State > > I think it is clear that configuration for this should go in mdadm.conf. > This would at least cover identifying groups of device by path and ways > what is allowed to be done to those devices. > It is possible that some configuration could be determined by inspecting > the hardware directly. e.g. the IMSM code currently looks for an Option > ROM show confirms that the right Intel controller is present and so the > system can boot from the IMSM device. It is possible that other > information could be gained this way so that the mdadm.conf configuration > would not need to identify paths but alternately identify some > platform-specific concept. > > The configuration would have to say what is permitted for hot-plugged > devices: nothing, re-add, claim-bare-only, claim-any-unrecognised > The configuration would also describe mobility of spares across > different device sets. > > This would add a new line type to mdadm.conf. e.g. > DOMAIN or CHASSIS or DEDICATED or something else. > The line would identify > some devices by path or platform > a metadata type that is expected here > what hotplug is allows to do > a spare-group that applies to all array which use devices from this > group/domain/chassis/thing > source for MBR? template for partitioning? or would this always > be copied from some other device in the set if hotplug= allowed > partitioning? > > State required would include > - where devices have been recently removed from, and what they were in > use for > - which arrays are currently using which device sets, though that can > be determined dynamically from inspecting active arrays. > - ?? partition tables off any devices that are in use so if they are > removed and an new device added the partition table can be > replaced. > > Usability > > The idea of being able to pull out a device and plug in a replacement and > have it all "just work" is a good one. However I don't want to be too > dependent on state that might have been saved from the old device. > I would like to also be able to point to a new device which didn't exist > before and say "use this". mdadm would use the path information to decide > which contain or set of drives was most appropriate, extract > MBR/partitioning from one of those, impose it on the new device and include > the device or partitions in the appropriate array. > > For RAID over partitions, this assumes a fairly regular configuration: all > devices partitioned the same way, and each array build out of a set of > aligned partitions (e.g. /dev/sd[bcde]2 ). > One of the strength of md is that you don't have to use such a restricted > configuration, but I think it would be very hard to reliably "to the right > thing" with an irregular set up (e.g. a raid1 over a 1T device and 2 500GB > devices in a raid0). > > So I think we should firmly limit the range of configurations for which > auto-magic stuff is done. Vendor metadata is already fairly strongly > defined. We just add a device to the vendor container and let it worry > about the detail. For native metadata we need to draw a firm line. > I think that line should be "all devices partitioned the same" but I > am open to discussion. > > If we have "mdadm --use-this-device-however" without needing to know > anything about pre-existing state, then a hot-remove would just need to > record that the device was used by arrays X and Y. Then on hot plug we could > - do nothing > - do something if metadata on device allows > - do use-this-device-however if there was a recent hot-remove of the device > - always do use-this-device-however > depending on configuration. > > Implementation > > I think we all agree that migrating spares between containers is best done > by "mdadm --monitor". It needs to be enhanced to intuit spare-group names > from "DOMAIN" declarations, and to move spares between vendor containers. > > For hot-plug, hot-unplug I prefer to use udev triggers. plug runs > mdadm --incremental /dev/whatever > which would be extended to do other clever things if allowed > Unplug would run > mdadm --force-remove /dev/whatever > which finds any arrays containing the device (or partitions?) and > fail/removes them and records the fact with a timestamp. > > However if someone has a convincing reason to build this functionality > into "mdadm --monitor" instead using libudev I am willing to listen. > > Probably the most important first step is to determine a configuration > syntax and be sure it is broad enough to cover all needs. > > I'm thinking: > DOMAIN path=glob-pattern metadata=type hotplug=mode spare-group=name > > I explicitly have "path=" in case we find there is a need to identify > devices some other way - maybe by control vendor:device or some other > content-based approach > The spare-group name is inherited by any array with devices in this > domain as long as that doesn't result it in having two different > spare-group names. > I'm not sure if "metadata=" is really needed. If all the arrays that use > these devices have the same metadata, it would be redundant to list it here. > If they use different metadata ... then what? > I guess two different DOMAIN lines could identify the same devices and > list different metadata types and given them different spare-group > names. However you cannot support hotplug of bare devices into both ... > > If it possible for multiple DOMAIN lines to identify the same device, > e.g. by having more or less specific patterns. In this case the spare-group > names are ignored if they conflict, and the hotplug mode used is the most > permissive. > > hotplug modes are: > none - ignore any hotplugged device > incr - normal incremental assembly (the default). If the device has > metadata that matches an array, try to add it to the array > replace - If above fails and a device was recently removed from this > same path, add this device to the same array(s) that the old devices > was part of > include - If the above fails and the device has not recognisable metadata > add it to any array/container that uses devices in this domain, > partitioning first if necessary. > force - as above but ignore any pre-existing metadata > > > I'm not sure that all those are needed, or are the best names. Names like > ignore, reattach, rebuild, rebuild_spare > have also been suggested. > > It might be useful to have a 'partition=type' flag to specify MBR or GPT ?? > > > There, I think that just about covers everything relevant from the various > conversations. > Please feel free to disagree or suggest new use cases or explain why this > would not work or would not be ideal. > There was a suggestion that more state needed to be stored to support > auto-rebuild (detail of each device so they can be recovered exactly after a > device is pulled and a replacement added). I'm not convinced of this but am > happy to hear more explanations. > > Thanks, > NeilBrown > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > My feeling on the entire subject matter is that this is /not/ an easy decision. Computers are rarely correct when they guess at what an administrator wants, and attempting to implement the functionality within mdadm is prone to many limitations or re-inventing the wheel. If mdadm / mdmon is part of the process at all, I think it should be used to either fork an executable (script or otherwise) which invokes the administrative actions that have been pre-determined. I believe that the default action should be to do /nothing/. That is the only safe thing to do. If an administrative framework is desired that seems to fall under a larger project goal which is likely better covered by programs more aware of the overall system state. This route also allows for a range of scalability. It may be sufficient in an initramfs context to either spawn a shell or even just wait in a recovery console after the mdadm invocation returns failure. It might also be desired to use a very simple reaction which assumes any spare of sufficient size which is added should be allocated to the largest or closest comparable area based on pre-determined preferences. At the same time, I could see the value in mapping actual physical locations to an array, remembering any missing or failed device layouts, and re-creating the same layouts on the new device. However those actions are a little above what mdadm should be operating at. With both of those viewpoints I see the following solution. The most specific action match is followed. Action-matches should be restrict-able by path wildcard, simple size comparisons, AND state for metadata. As a final deciding factor action-matches should also have an optional priority value, so that when all else matches one rule out of a set will be known to run first. The result of matching an action, once again, should be an external program or shell to allow for maximum flexibility. I am not at all opposed to adding good default choices for those actions in either binary or shell script form. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html