RFC - device names and mdadm with some reference to udev.

Neil Brown <neilb@xxxxxxx> · Mon, 27 Oct 2008 09:56:12 +1100

Greeting.
 This is a Request For Comments....

 Device naming in mdadm is a bit of a mess.
 We have partitioned devices (mdp) and non-partitioned (md)
 We have names in /dev/md/ (/dev/md/d0) and directly in /dev
    (/dev/md_d0).
 We have support for user-friendly names (/dev/md/home) and for
    "kernel-internal" names (/dev/md0).

 All this can produce extra confusion when udev is brought into the
 picture.  And it can leave lots of litter lying around in /dev if we
 aren't careful (which we aren't).

 I hope to release mdadm-3.0 this year, and maybe that gives me a
 chance to get it "right".  I don't want to break backwards
 compatibility in a big way, but I think I am happy to introduce
 little changes if it means a more consistent model.

 In 2.6.28, partitioned devices (mdp) wont be needed any more as md
 will make use of the "extended partition" functionality recently
 added.  All md devices can be partitioned.  The device number for the
 partitions will be very different to that of the whole device, but
 udev should hide all of that.  So we don't have to worry too much
 about mdp devices.

 So I think the following is how I want things to work.  I am very
 open to comments and suggestions.  Particularly I want to know what
 (if anything) this will break.

 1/ The only device nodes created will be /dev/mdX and /dev/md_dX
    along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate.
    These will be created by mdadm in accordance with the "--auto"
    flag unless something in mdadm.conf says to leave it to udev.
    In that case, mdadm will create a temporary node
    (/dev/.mdadm.whatever) and remove it once udev has created the
    real thing.

 2/ There will be various symlinks to these devices.
    a/ if "symlinks=yes" is given in mdadm.conf, symlinks from
         /dev/md/X or /dev/md/dX will be created.
    b/ if udev is configured like on Debian,
              /dev/disk/by-id/md-name-XXXX
	and   /dev/disk/by-id/md-uuid-UUUU
       will be created (by udev).
    c/ If there is a 'name' associated with the array then
        /dev/md/name will be created as a link.
    d/ if an explicit device name of /dev/name was given,
        either on a -A, -B, -C, command or in mdadm.conf,
	then the 'name' must match the name of the array,
	and /dev/name will be used as well as /dev/md/name.

 3/ For a 'NAME' to be used, with as md-name-NAME or /dev/md/NAME,
    we need a high degree of confidence that the array was intended
    for "this" host, or otherwise is not going to conflict with
    an array that is meant for "this" host.
    We get this confidence in a number of ways:
    a/ If the name is listed in /etc/mdadm.conf 
       e.g.  ARRAY /dev/md/home UUID=XXXX.....
    b/ If the name was given on the command line
    b/ If the name is stored in the metadata of an array which is
       explicitly identifed in mdadm.conf or by the command line.
    c/ If the name is of the form  "host:name" and "host" matches
       this host.  We then use just "name".
    d/ If the name is of the form "host:name" and "host" does not
       match this host, we can still assume that "host:name" is
       unique and use that.
    e/ For 0.90 metadata, if the uuid has the host name encoded in it
       then it was intended for 'this' host.

    Thus unsafe names are names extracted from the metadata of arrays
    which are auto-detected, where there is no hint in the metadata
    that the array is built for 'this' host.

    If the NAME is not known to be safe, we can still assemble the
    array, but we use a "random" high minor number, and allow it
    to be found primarily by the by-id/md-uuid-UUUUU... link or some
    other link created based on array content: e.g. disk/by-label/
    Also the array will be assembled "auto-readonly" so no resync etc
    will happen until the array is actually used.

    mdadm-3.0 will be able to support "containers" such as a set of
    devices with DDF metadata.  These can then contain a number of
    different arrays.  If the 'container' is known to be local to
    'this' host, then we assume that all contained arrays are too.

    I'm contemplating creating a link based on the metadata type with
    a sequential number. e.g. /dev/md/ddf1 or /dev/md/imsm2.
    I'm not sure if there should be in /dev/md/ or directly in /dev/.
    I'm also not sure if I should leave the creation to udev, and
    whether I should use a small sequential number, or just whatever
    number was allocated as the minor number of the device.

 4/ When we stop an array, mdadm will remove anything from /dev that
    it probably created.
    In particular, it will remove the device node as described in 1,
    any partitions, and any symlinks in /dev or /dev/md which point to
    any of those.  I need to be certain that this won't confuse udev.

 5/ I want to enable assembly without having to give
    an explicit device name, thus requiring mdadm to automatically
    assign one just as it would for auto-assembly.
    In particular, the "ARRAY" line in mdadm.conf will no longer
    require an array name.  That would mean that "-Es" wouldn't need
    to produce an array name (which is not always easy).
    So:
        mdadm -Es > /tmp/mdadm.conf
	mdadm -Asc/tmp/mdadm.conf
    would leave the choice of device name to the "-A" stage which is
    the only time that unique non-predictable names can be chosen.

 6/ I'm thinking that if the array name given to --create or
    --assemble looks as though it identifies a metadata type, by
    having the name of a metadata type followed by some digits,
    e.g. /dev/ddf0 or /dev/md/imsm3
    then we insist that the array have that metadata type.
    That could mean that a future metadata type might conflict with
    a previously valid usage, which would be a bore.
    Maybe if there are trailing digits, then it *must* identify a
    metadata type, or be "mdNN".

Some issues that all of this needs to address:

 1/ People want auto-assembly.  I've always fought against it (we
    don't auto-mount all filesystems do we?).  But it is a loosing
    battle.  And on a modern desktop, when you plug in a new drive the
    filesystem is automatically mounted.  So my argument is falling
    apart.

 2/ Auto-assembly of new arrays must not conflict with auto-assembly
    of previously existing arrays, even if the devices comprising the
    new arrays are discovered earlier.  This is what the 'homehost'
    concept is for.  Your array will only get assembled with a
    predictable name if it is known to be attached to 'this' host.

 3/ Auto-assembly needs to handle incremental arrival of devices
    correctly.  There are no easy solutions to this, particularly when
    e.g. ext3 can write to the device even when mounted read-only (for
    journal replay).
    I think the best that I can do for now is assemble things
    'read-auto' to delay any writes a long as possible in the hope
    that all available devices will be connected by then.
    Adding in-memory bitmaps for all degraded array to accelerate
    rebuild would help but won't be in 2.6.28.

 4/ auto-assembly needs to do the right thing on a SAN where multiple
    hosts can each see multiple arrays.  Clearly only one host should
    write to any one array at one time (until I get some
    cluster-awareness going, which I had hoped to work on this year,
    but it doesn't look like I will).
    In this case, I don't think read-auto is enough.  We either need
    to not assemble arrays when aren't known to belong to us, or we
    need to assemble them read-only and require and explicit
    read-write setting.

    So we need some way to know which devices could be visible to
    other hosts.
    I could have a global flag in mdadm.conf "Options SAN"
    I could have a SAN-DEVICES to match "DEVICES", but as just about
    everything is "/dev/sd*" these days, I don't know if that would
    work.

    Any suggestions concerning this would be welcome.

I'm also wondering if I should include a udev 'rules' file for md in
the mdadm distribution.  Obviously it would be no more than a
recommendation, but it might give me a voice in guiding how udev
interacted with mdadm.

Any thoughts of any of this would be most welcome.

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html