Re: Auto Rebuild on hot-plug

Michael Evans <mjevans1983@xxxxxxxxx> · Wed, 24 Mar 2010 19:47:59 -0700

On Wed, Mar 24, 2010 at 5:35 PM, Neil Brown <neilb@xxxxxxx> wrote:
>
> Greetings.
>  I find myself in the middle of two separate off-list conversations on the
>  same topic and it has reached the point where I think the conversations
>  really need to be unite and brought on-list.
>
>  So here is my current understanding and thoughts.
>
>  The topic is about making rebuild after a failure easier.  It strikes me as
>  particularly relevant after the link  Bill Davidsen recently forwards to the
>  list:
>
>       http://blogs.techrepublic.com.com/opensource/?p=1368
>
>  The most significant thing I got from this was a complain in the comments
>  that managing md raid was too complex and hence error-prone.
>
>  I see the issue as breaking down in to two parts.
>  1/ When a device is hot plugged into the system, is md allowed to use it as
>     a spare for recovery?
>  2/ If md has a spare device, what set of arrays can it be used in if needed.
>
>  A typical hot plug event will need to address both of these questions in
>  turn before recovery actually starts.
>
>  Part 1.
>
>  A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
>  other vendor metadata) or LVM or a filesystem.  It might have a partition
>  table which could be subordinate to or super-ordinate to other metadata.
>  (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
>  not be stale.  It may or may not match - either strongly or weakly -
>  metadata on devices in currently active arrays.
>
>  A newly hotplugged device also has a "path" which we can see
>  in /dev/disk/by-path.  This is somehow indicative of a physical location.
>  This path may be the same as the path of a device which was recently
>  removed.  It might be one of a set of paths which make up a "RAID chassis".
>  It might be one of a set of paths one which we happen to find other RAID
>  arrays.
>
>  Some how from all of that information we need to decide if md can use the
>  device without asking, or possibly with a simple yes/no question, and we
>  need to decide what to actually do with the device.
>
>  Options for what to do with the device include:
>    - write an MBR and partition table, then do something as below with
>      each partition
>    - include the device (or partition) in an array that it was previously
>      part of, but from which it was removed
>    - include the device or partition as a spare in a native-metadata array.
>    - add the device as a spare to a vendor-metadata array
>
>  Part 2.
>
>   If we have a spare device and a degraded array we need to know if it is OK
>   to add the device as a hot-spare to that array.
>   Currently this is handled (for native metadata) by 'mdadm --monitor' and
>   the  spare-groups tag in mdadm.conf.
>   For vendor metadata, if the spare is already in the container then mdmon
>   should handle the spare assignment, but if the spare is in a different
>   container, 'mdadm --monitor' should move it to the right container, but
>   doesn't yet.
>
>   The "spare-group" functionality works but isn't necessarily the easiest
>   way to express the configuration desires.  People are likely to want to
>   specify how far a global spare can migrate using physical address: path.
>
>   So for example you might specify a group of paths with wildcards with the
>   implication that all arrays which contain disks from this group of paths
>   are automatically in the same spare-group.
>
>
>  Configuration and State
>
>   I think it is clear that configuration for this should go in mdadm.conf.
>   This would at least cover identifying groups of device by path and ways
>   what is allowed to be done to those devices.
>   It is possible that some configuration could be determined by inspecting
>   the hardware directly.  e.g. the IMSM code currently looks for an Option
>   ROM show confirms that the right Intel controller is present and so the
>   system can boot from the IMSM device.  It is possible that other
>   information could be gained this way so that the mdadm.conf configuration
>   would not need to identify paths but alternately identify some
>   platform-specific concept.
>
>   The configuration would have to say what is permitted for hot-plugged
>   devices:  nothing, re-add, claim-bare-only, claim-any-unrecognised
>   The configuration would also describe mobility of spares across
>   different device sets.
>
>   This would add a new line type to mdadm.conf. e.g.
>     DOMAIN or CHASSIS or DEDICATED or something else.
>   The line would identify
>         some devices by path or platform
>         a metadata type that is expected here
>         what hotplug is allows to do
>         a spare-group that applies to all array which use devices from this
>         group/domain/chassis/thing
>         source for MBR?  template for partitioning?  or would this always
>             be copied from some other device in the set if hotplug= allowed
>             partitioning?
>
>   State required would include
>       - where devices have been recently removed from, and what they were in
>         use for
>       - which arrays are currently using which device sets, though that can
>         be determined dynamically from inspecting active arrays.
>       - ?? partition tables off any devices that are in use so if they are
>         removed and an new device added the partition table can be
>         replaced.
>
>  Usability
>
>  The idea of being able to pull out a device and plug in a replacement and
>  have it all "just work" is a good one.  However I don't want to be too
>  dependent on state that might have been saved from the old device.
>  I would like to also be able to point to a new device which didn't exist
>  before and say "use this".   mdadm would use the path information to decide
>  which contain or set of drives was most appropriate, extract
>  MBR/partitioning from one of those, impose it on the new device and include
>  the device or partitions in the appropriate array.
>
>  For RAID over partitions, this assumes a fairly regular configuration: all
>  devices partitioned the same way, and each array build out of a set of
>  aligned partitions (e.g. /dev/sd[bcde]2 ).
>  One of the strength of md is that you don't have to use such a restricted
>  configuration, but I think it would be very hard to reliably "to the right
>  thing" with an irregular set up (e.g. a raid1 over a 1T device and 2 500GB
>  devices in a raid0).
>
>  So I think we should firmly limit the range of configurations for which
>  auto-magic stuff is done.  Vendor metadata is already fairly strongly
>  defined.  We just add a device to the vendor container and let it worry
>  about the detail.  For native metadata we need to draw a firm line.
>  I think that line should be "all devices partitioned the same" but I
>  am open to discussion.
>
>  If we have "mdadm --use-this-device-however" without needing to know
>  anything about pre-existing state, then a hot-remove would just need to
>  record that the device was used by arrays X and Y. Then on hot plug we could
>   - do nothing
>   - do something if metadata on device allows
>   - do use-this-device-however if there was a recent hot-remove of the device
>   - always do use-this-device-however
>  depending on configuration.
>
>  Implementation
>
>  I think we all agree that migrating spares between containers is best done
>  by "mdadm --monitor".  It needs to be enhanced to intuit spare-group names
>  from "DOMAIN" declarations, and to move spares between vendor containers.
>
>  For hot-plug, hot-unplug I prefer to use udev triggers.  plug runs
>    mdadm --incremental /dev/whatever
>  which would be extended to do other clever things if allowed
>  Unplug would run
>     mdadm --force-remove /dev/whatever
>  which finds any arrays containing the device (or partitions?) and
>  fail/removes them and records the fact with a timestamp.
>
>  However if someone has a convincing reason to build this functionality
>  into  "mdadm --monitor" instead using libudev I am willing to listen.
>
>  Probably the most important first step is to determine a configuration
>  syntax and be sure it is broad enough to cover all needs.
>
>  I'm thinking:
>    DOMAIN path=glob-pattern metadata=type  hotplug=mode  spare-group=name
>
>  I explicitly have "path=" in case we find there is a need to identify
>  devices some other way - maybe by control vendor:device or some other
>  content-based approach
>  The spare-group name is inherited by any array with devices in this
>  domain as long as that doesn't result it in having two different
>  spare-group names.
>  I'm not sure if "metadata=" is really needed.  If all the arrays that use
>  these devices have the same metadata, it would be redundant to list it here.
>  If they use different metadata ... then what?
>  I guess two different DOMAIN lines could identify the same devices and
>  list different metadata types and given them different spare-group
>  names.  However you cannot support hotplug of bare devices into both ...
>
>  If it possible for multiple DOMAIN lines to identify the same device,
>  e.g. by having more or less specific patterns. In this case the spare-group
>  names are ignored if they conflict, and the hotplug mode used is the most
>  permissive.
>
>  hotplug modes are:
>    none  - ignore any hotplugged device
>    incr  - normal incremental assembly (the default).  If the device has
>         metadata that matches an array, try to add it to the array
>    replace - If above fails and a device was recently removed from this
>         same path, add this device to the same array(s) that the old devices
>         was part of
>    include - If the above fails and the device has not recognisable metadata
>         add it to any array/container that uses devices in this domain,
>         partitioning first if necessary.
>    force - as above but ignore any pre-existing metadata
>
>
>  I'm not sure that all those are needed, or are the best names.  Names like
>    ignore, reattach, rebuild, rebuild_spare
>  have also been suggested.
>
>  It might be useful to have a 'partition=type' flag to specify MBR or GPT ??
>
>
> There, I think that just about covers everything relevant from the various
> conversations.
> Please feel free to disagree or suggest new use cases or explain why this
> would not work or would not be ideal.
> There was a suggestion that more state needed to be stored to support
> auto-rebuild (detail of each device so they can be recovered exactly after a
> device is pulled and a replacement added).  I'm not convinced of this but am
> happy to hear more explanations.
>
> Thanks,
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

My feeling on the entire subject matter is that this is /not/ an easy
decision.  Computers are rarely correct when they guess at what an
administrator wants, and attempting to implement the functionality
within mdadm is prone to many limitations or re-inventing the wheel.

If mdadm / mdmon is part of the process at all, I think it should be
used to either fork an executable (script or otherwise) which invokes
the administrative actions that have been pre-determined.

I believe that the default action should be to do /nothing/.  That is
the only safe thing to do.  If an administrative framework is desired
that seems to fall under a larger project goal which is likely better
covered by programs more aware of the overall system state.  This
route also allows for a range of scalability.

It may be sufficient in an initramfs context to either spawn a shell
or even just wait in a recovery console after the mdadm invocation
returns failure.  It might also be desired to use a very simple
reaction which assumes any spare of sufficient size which is added
should be allocated to the largest or closest comparable area based on
pre-determined preferences.

At the same time, I could see the value in mapping actual physical
locations to an array, remembering any missing or failed device
layouts, and re-creating the same layouts on the new device.  However
those actions are a little above what mdadm should be operating at.

With both of those viewpoints I see the following solution.

The most specific action match is followed.

Action-matches should be restrict-able by path wildcard, simple size
comparisons, AND state for metadata.
As a final deciding factor action-matches should also have an optional
priority value, so that when all else matches one rule out of a set
will be known to run first.

The result of matching an action, once again, should be an external
program or shell to allow for maximum flexibility.

I am not at all opposed to adding good default choices for those
actions in either binary or shell script form.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html