Re: [Patch] mdadm ignoring homehost?

Doug Ledford <dledford@xxxxxxxxxx> · Mon, 20 Apr 2009 09:15:15 -0400

On Apr 20, 2009, at 3:23 AM, Neil Brown wrote:
On Friday April 17, dledford@xxxxxxxxxx wrote:
On Apr 16, 2009, at 11:49 PM, Neil Brown wrote:
On Monday April 6, dledford@xxxxxxxxxx wrote:
On Apr 1, 2009, at 6:46 PM, Neil Brown wrote:

This appears to be the difference between a server setup and a  
desktop
setup.  Server admins want to list things and only have known actions
happen.  Desktop people want things to "just work".  I've had several
people tell me they thought the idea of mdadm.conf was completely out
of date and it should just go away entirely.  Not saying I agree,  
just
letting you know what I get.

:-)

I'm not sure I'm happy with expecting people to do that
(though of course I'm happy to support it).

I really don't expect them to per se.  More like it's the *safe*  
thing
to do.  If you ever have a conflict in names, the one in the file
wins.  If you ever have a conflict in names without one of them in  
the
file, then it's whoever got there first.  In that sense, mdadm.conf  
is
just a backup for me.  Well, that and mkinitrd doesn't do incremental
assembly, so it's needed for boot in my case.  But that could be
changed.

So the safe thing to do is to create mdadm.conf.  But we all know that
the convenient thing to do is not to create mdadm.conf.

Thus safe and convenient are separate.  This sounds like bad design.

That's life.  It's always more convenient to do the non-safe thing.   
The question isn't whether or not they are different, but what level  
of safety do you give up for convenience.

I like it that not creating mdadm.conf is a little bit inconvenient in
that you are more likely to get names with _N suffixes.  It (I hope)
motivates people to become safe, either by making sure homehost works,
or be creating mdadm.conf.

The case that I want to avoid is this:
 You have two machines that each boot off their own md0.
 Late one night machine A dies.  So you get called in, while half
 asleep, to get the data back on line.
 You shut down B, pull the drive out of A and plug them into B and
 then boot B.
 You find that it made a root filesytem from the drives that were in A
 rather than in B.

This could be just inconvenient, or it could be a serious mess.

This is a total non-issue.  It can't happen (at least in Fedora).   
It's a 100% impossibility.  The reason is that if you have a / raid  
array, it's started by the initrd, and the initrd uses assemble and an  
mdadm.conf file (you wouldn't be able to boot otherwise, regardless of  
whether or not you've moved drives, the / raid array *must* be in  
mdadm.conf).  The same is also true of the other machine.  So, the  
only way for this to happen is if the admin inserted the drive into a  
location that was before the existing drive in the BIOS boot order, in  
which case *we* and mdadm can't do a damn thing about that.  Doing  
something inconvenient in an attempt to solve a problem that we  
*can't* solve does us no good what so ever.

I don't want people to discover these potential naming conflicts which
trying to recover from a disaster.  I want them to discover them when
initially setting up their array.

Realistically, the admin would need to notice the different host name  
in this case.  If it brought up the wrong root, then at a minimum, the  
host name should be off.  If you are using dhcp host name setup for  
your servers in this situation, then depending on whether or not the  
servers are identical, you might actually be just fine running off the  
other root as the original machine.  In any case though, we can't do  
anything about it.  We booted off the wrong drive, so we'll have the  
wrong mdadm.conf and that mdadm.conf will think all the remote arrays  
are local.

To achieve that, I should probably make the _N suffix truly random
rather than simply arbitrary.  But I haven't done that.  Yet.

So the various parts of your algorithm which involve heuristics
based on the entries in mdadm.conf - or on the existence of  
mdadm.conf
itself - are parts that I don't feel comfortable with.

What is left?  Well, the observation that moving an external
multi-drive enclosure between hosts causes confusing naming is a  
valid
and useful observation.

Someone should be able to create an array on such a device called
'foo' and get '/dev/md/foo' created on any host.
The best thought I have come to so far is to support (and document)
something like
--create --homehost=any
or
--create --homehost=*

with the meaning that the array so created will get preferential
access to it's recorded name (i.e. no "_0" suffix).

I also wonder if, when mdadm finds an array that is explicitly for
another host, we could use that host name rather than _0 to
disambiguate.  So
--create /dev/md/foo --homehost=bob
when assembled on some other host is called
     /dev/md/foo_bob
that might at least make it more obvious what is happening.

This is probably where you and I disagree.  I don't think you are
disambiguating.  I think you are confounding the common case of no
conflict.  If someone has a non-portable array, like /, they commonly
use something like /dev/md0.  That, you will likely never get a
conflict on.

Except in the above false scenario, when you least want it to happen.
                      ^  There, corrected that for you.

             On the other hand, if someone creates an array to be
mobile, it will likely have a higher number (or it could be 0, but
that implies they aren't using root raid arrays on their machines in
all likelihood).  So, if you make a mobile array, just give it any  
old
number you can remember other than the normal base numbers used by  
non-
portable arrays, and viola, no conflicts (note that this is also  
why I
was in favor of a completely numberless md setup, where device
major:minor do not impact name of the array at all, and you are free
to create something like /dev/md/root and there will be no access  
file
other than /dev/md/root, specifically no alias from /dev/md0 to /dev/
md/root...it's much easier to remember names than numbers, and much
easier to create a scheme that avoids conflicts 100% of the time).   
As
it stands though, the current code still won't honor random names as
though that was the official and canonical name of the array, it
insists on creating a /dev/md# device and then just symlinking the
name as though the /dev/md# device is canonical.  In one of your
previous emails you mentioned something about how bad design  
decisions
get entrenched and can never be rooted out, I would point to this
;-)

I had forgotten about this...
The kernel supports this.  We just need to make sure it works with
udev and get mdadm to use it.

echo md_foo > /sys/modules/md_mod/parameters/new_array
ls -l /dev/md_foo

no numbers at all.

Even in the output of /proc/mdstat?

Maybe we can start using this in 3.1.
But I'm not sure how this relates to the current problem of how to
choose a name based on the contents of the metadata.

You draw a distinction between mobile and non-mobile arrays.  Quite
possibly that is a useful distinction to pursue.

It is the non-mobile arrays that I am particularly concerned about.
If someone plugs in a mobile array I'm happy to give them whatever
name seems like a good idea - conflicts aren't such a problem.

But how can we tell the difference???

Well, we could look in /etc/fstab (unless the same people who think
/etc/mdadm.conf is old fashioned manage to get rid of /etc/fstab as
well).

How about this:
 A name is 'local' if:
   it is associated with the array via mdadm.conf or
   it is associated with this host via 'homehost'
 A name is 'non-mobile' if:
   it is associated with some use in /dev/fstab

At least in Fedora, this is a useless distinction.  For an array to be  
in fstab, it must first be in mdadm.conf.  We only allow non-fstab  
arrays to be autoassembled without also being in mdadm.conf.  And only  
then on hot plug that happens post-boot.

The Fedora mdadm bring up sequence goes like this:

1) In initrd, bring up any / raid arrays (supports stacked arrays and  
the like) using mdadm -As --run /dev/<device> (this way we support  
degraded array bring up as best possible)
2) In rc.sysinit, we start udev, but our udev incremental assembly  
rule checks if we are in rc.sysinit and skips incremental assembly as  
long as we are still there.  The start udev command will initiate the  
only add event we will get on the known devices at that time, so all  
those add events for devices already found by the system get ignored.
3) Later in rc.sysinit, if we have both /sbin/mdadm and /etc/ 
mdadm.conf, we run mdadm -As --run to bring up all listed arrays in  
mdadm.conf (your patch for ignoring an array would be useful here, and  
allow an even finer grained control than my ASSEMBLY/INCREMENTAL  
settings, although I could see those two complementing each other in  
that you could individually stop assembly of just select arrays, or  
turn off all assembly, so I see value in both options)
4) Once we leave rc.sysinit, we have started all listed md raid  
arrays, *and* we have mounted the local filesystems in fstab.  Only  
now does udev incremental assembly start working.  And since we've  
already processed all the add events for devices present at boot, it  
only attempts to assemble things plugged in after this point in time.

I should note that this method of splitting autoassembly from udev  
autoassembly also allows me to start all the non-hotplug devices with  
the --run option, while in my udev rule I only start them when they  
are complete, never when degraded.  This way if you hot plug an  
incomplete array, we don't do anything automatically.  We limit the  
automatic, make if "just work" actions to things that are fully there,  
but accept a degraded state on stuff we need to boot.

Now, this might raise the question of "what if I put a hot plug array  
into my mdadm.conf, will that stop me from booting?"  The answer is  
no.  The mdadm -As --run command will start all available arrays  
listed in mdadm.conf.  If you list mobile arrays in there, but they  
aren't plugged in, then they will get happily ignored (and if they are  
there, they'll get brought up, which would be necessary since udev  
won't process them later).  In fact, if *none* of the arrays get  
started, the mdadm failure to start arrays will not stop the boot  
sequence.  It won't be until you get to attempting to mount an array  
that isn't running that rc.sysinit will kick you out to a fix  
filesystem prompt.  So, you are free to list arrays in mdadm.conf that  
you don't need for bootup, but you are required to list arrays you do  
need for bootup.

Then if a name is either 'local' or not 'non-mobile', then we feel  
free to
use it as it stands, otherwise we add a _N suffix.
I think this is fairly close to using 'my' rules for things listed in
/etc/fstab, and 'your' rules for everything else.

This is tempting, but feels like it might be a bit fragile.
Does anything other than /etc/fstab depend on device names to
find things that are stored on devices?

One fragility would appear when running "mdadm -As" in an initrd.
You might not have an /etc/fstab at all, so everything might get
assembled using the wrong set of rules.

Maybe there is a safe way to detect "in initrd" and impose the
conservative rules in that case.

Note that 0.90 metadata does contain homehost information to some
extent.  When homehost is set, the last few bytes of the uuid is set
from a hash of the homehost name.  That makes it possible to test  
if a
0.90 array was created for 'this' host, but not to find out what  
host
it was created for.  So the above expedient won't work for 0.90
arrays, but the rest of the homehost concept (including any possible
'homehost=any' option) does.

You note that arrays with no homehost are treated as foreign with  
not
always being a good thing.  In 3.0, homehost is no longer optional.
If it is not explicitly set, it will default to `uname -n`.  So  
newly
created arrays will not suffer from this problem.  Arrays created  
with
mdadm 2.x do.  They can be 'upgraded' with
  --assemble --update=homehost
which is a suggestion that should be put in the man page.

This is a bad idea, and just reinforces my thought that we shouldn't
be paying attention to homehost.  Amongst the most important aspects
are machines that are booted up, installed, raid arrays created  
during
install, then shut down and moved, likely changing dhcp hostnames in
the process.  Now all your homehosts belong to some hostname in some
IT guys install network instead of in your final network.  At install
time, it's actually fairly common that the hostname is not yet set,
especially at raid array creation time.

But it should be fairly straight forward for the IT guys to arrange
that an mdadm.conf creates created which record the UUID of the array.
If the UUID is in mdadm.conf, you don't need homehost.

OK, then see above for why, at least in Fedora, homehost has no value.

Your idea of allowing the names "/dev/md0" and "md0" to connect with
the
minor number '0' in the same way that the name "0" does is a good
one.  I have implemented that.

I think I am leaning towards 'homehost=any' rather than 'homehost=*'
and will implement that. (No one would have a computer called 'any'
would they?).

Thanks again for your input.

No problem.

Maybe a summary is in order.
We have:

A - arrays that clearly belong on 'this' machine.  Either they are
    unambiguously listed in mdadm.conf, or they container homehost
    information that ties them to this computers.
B - arrays that explicitly list another host in their metadata
C - arrays that don't explicitly list a host.

and

a - devices name that are explicitly record, e.g. in /etc/fstab
b - device name that are not explicit used and so are only
    interesting to people.

We have:

1 - boot time, when we want to be cautious about not assembling
   the wrong thing
2 - normal run time when we have mounted all the really important
   filesystems and  we can be less cautious.

and we have:

i  - cases when we want to explicitly not assemble certain arrays,
     such as SAN environments
ii - cases when we want to assemble anything that appears

And various combinations that different people feel strongly about.
And the question is:  can we actually please all the people all the
time?

I think that if we can make a reliable and meaningful distinction
between 1 and 2, and between a and b,  and if we assemble only A in
case 1, and never assemble 'a' which is not 'A', and if we support
disabling of autoassembly for everything, or specific metadata types,
or specific arrays, in mdadm.conf - then we come pretty close.

Does anyone have thoughts on the 1 vs 2 distinction?? or the a vs b
distinction.

I'm not sure that the B vs C distinction is of any value, but I
thought I would mention it for completeness.

IMO, with my name selection patch, with the option to list an array as  
ignore, and with the option to turn either assembly or incremental  
modes completely off as I suggested, done in combination with a boot  
sequence like we now use in Fedora, you have this problem solved.  The  
only niggling thing might be if mdadm -As --run without a device  
specifier will pickup arrays via the DEVICE partitions that don't  
exist with ARRAY lines, but even if it does, my name patch means that  
any ARRAY lines will supersede any randomly found devices so the  
expected name will go to the expected place, and not elsewhere (and I  
did verify that the array that needs to have its normal name need not  
be up and running for the interloper array to kicked to another  
name...it gets kicked on the fact that it doesn't match the array in  
mdadm.conf, not on the presence of the matching array).  (Note: the  
bootup sequence I listed above is our new sequence as of F11, older  
versions of fedora were similar, but didn't limit incremental assembly  
to only outside of rc.sysinit, and so there were some bugs in actual  
usage)

--

Doug Ledford <dledford@xxxxxxxxxx>

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband

Attachment:
PGP.sig

Description: This is a digitally signed message part