Re: ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux

Hans de Goede <hdegoede@xxxxxxxxxx> · Thu, 26 Nov 2009 10:31:18 +0100

Hi Doug,

That is a lot of information in there, let me try to summarize it
and please let me know if I've missed anything:

1) The default chunksize for raid4/5/6 is changing, this should
   not be a problem as we do not specify a chunksize when creating
   new arrays

2) The default bitmap chunk size changed, again not a problem as
   we don't use bitmaps in anaconda atm

3) We need to change the not using of a bitmap, we should use a bitmap
   by default except when the array will be used for /boot or swap.

   Questions:
   1) What commandline option should we pass to "mdadm --create" to
      achieve this?

4) We need to start specifying a superblock version, and preferably
   version 1.1

5) Specifying a superblock version of 1.1 will render systems non
   bootable, I assume this only applies to systems which have
   a raid1 /boot, so I guess that we need to specify a superblock
   version of 1.1, except when the raid set will be used for /boot,
   where we should keep using 0.9

   Questions:
   1) Is the above correct ?

6) When creating 1.1 superblock sets we need to pass in:
   --homehost=<hostname>
   --name=<devicename>
   -e{1.0,1.1,1.2}

   Questions
   1) Currently when creating a set, we do for example:
      mdadm --create /dev/md0 --run --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1

      What would this look like with the new mdadm, esp, what would happen to the
      /dev/md0 argument ?

   If we can still specify which minor to use when creating a new array, even though
   that minor may change after the first reboot, then the amount of changes needed
   to the installer are minimal and we can likely do this without problems for RHEL-6.

Regards,

Hans

On 11/26/2009 03:59 AM, Doug Ledford wrote:
Please keep me on the Cc: as I'm not on this list.

Upstream recently released mdadm-3.1.1, which I intend to include in
Fedora soon.  It finally updates three default settings that should have
been updated a long time ago.

The default chunk size for raid4/5/6 is now 512K.  Anaconda needs to be
updated to either leave the default alone or use 512K itself.  In the
past it has passed in 256K, but extensive performance testing shows that
512K is indeed the sweet spot on pretty much any SATA device, which
simply due to SATA being the overwhelming majority of disks we run on
today, it's sweet spot should be our default.

It updates the default bitmap chunk to be at least 65536K when using an
internal bitmap.  Performance tests showed as much as a 10% performance
penalty for the old default bitmap chunk (8192K).  The new bitmap chunk
reduces that performance penalty (although we don't have solid numbers
on how much...I'll work on that).  However, we've never used a bitmap by
default on any arrays we create.  That needs to change.  The simple
logic is this: no bitmap on /boot or any swap partitions, use a bitmap
on anything else.  If we need a bitmap chunk other than the default,
I'll follow up here.

It updates the default superblock format from the old, antiquated,
deprecated version 0.90 superblock that we should have quit using years
ago to version 1.1.  This is the real kicker.  Since anaconda has never
actively set the superblock metadata version (even though we should have
been using 1.1 long ago), it's now going to have to start.  The reason
is that unless you upgrade machines to use an md raid aware boot loader,
such as grub2 for x86 although I have no idea what would work on non-x86
arches, version 1.1 superblocks will render all installs unbootable.
More importantly though, unless the anaconda team decides to blindly set
all superblocks back to the old version 0.90 format, this change
necessitates more than just a change to controlling which version of 1.x
superblock we use on any given array, but also a change to how we create
and name arrays in general.  Version 0.90 superblocks are from back in
the day when we thought it was smart/reasonable to name arrays by number
and to mount scsi devices in fstab by their /dev/ entry.  That day has
long since been gone, dead and buried.  We switched filesystems to mount
by label so they are immune to device number changes and similarly
version 1.x superblocks totally do away with the preferred-minor field
in the superblock.  Instead, they have a homehost and name field that
are used to control device *naming*, not numbering, and in a properly
running version 1.x superblock system, the device numbers are not
guaranteed to be static from boot to boot (although they usually are).
This doesn't appear to be much problem for dracut, but as an example,
I'm attaching the mkinitrd patch I have to apply to an F11 system after
every mkinitrd update in order to get initrd images that mount by name
properly.

So, those are the major differences.  Switching to any of the version
1.x superblocks necessitates that anaconda pass a few arguments that it
hasn't in the past.  Right now, these are the things anaconda is going
to need to start passing in on any mdadm create commands (that I don't
currently believe it does, but I haven't checked and could be wrong):

--homehost=<hostname>
--name=<devicename>
-e{1.0,1.1,1.2}

In addition, we should start passing the bitmap option as I outlined above.

We will also likely need to set the HOMEHOST entry in mdadm.conf and
possibly the AUTO entry in mdadm.conf as well.

And this brings me to a different point.  Hans asked me to comment on
bz537329.  I would suggest people look at my comments there for some
additional explanation of why ideas like trying to make things work
without mdadm.conf are probably a bad idea.

So here are a few additional things that I think are worth taking into
consideration.

If an array is listed in mdadm.conf, then *every* item on the array line
must match the array or else it will fail to start.  This means that
ARRAY lines that list things that can change by using mdadm --grow to
change aspects of the array can result in the array failing to be found
on the next reboot.  Therefore, it would be best if each new ARRAY line
we write includes nothing besides the name of the array, the metadata
version, and the UUID.

If an array is listed in mdadm.conf, then both the --homehost and --name
settings will be overridden by the name in the mdadm.conf file, so do
not depend on either having an effect for arrays listed in mdadm.conf.

However, homehost and name are both used heavily any time the array is
not listed in mdadm.conf so setting them correctly is still important.
There are a number of common scenarios that make this important: you are
carrying an array from machine to machine (like an external drive tower,
or raid1 usb flash drive, etc.), when an array is visible to multiple
hosts (like arrays built over SAN devices), or when you've built a
machine to replace an existing machine and you temporarily install the
drives from the machine being replaced in the new machine to copy data
across in which case you are starting both your new array and the old
array on the same machine.  They are also relied upon heavily in order
to attempt to satisfy those people that think the md raid stack should
work without any mdadm.conf file at all.  And there is a special case
exception in the name field that is used to attempt to preserve back
compatibility.  The intersection of all these attempts to satisfy
various needs is tricky.  Here's how names are determined:

1) If the array is identified in mdadm.conf, the name from the ARRAY
line is used.
2) If HOMEHOST has been set in the config
	a) If the array uses a version 0.90 superblock, check to see if the
HOMEHOST has been encoded in the UUID via hash.  If not, treat as
foreign, if so, treat as local.
	b) For version 1.x superblocks check the homehost in the superblock
against the set homehost.  If they match, treat as local, else if the
homehost in the superblock is not empty treat as named foreign else
treat as foreign.
3) else
	a) for version 0.90 superblocks treat the array as foreign.
	b) for 1.x if homehost is set then named foreign else foreign.

In case #1, the name as it's in the file is used.  If the remainder of
cases, local means to attempt to create the array with the requested
number (in the case of 0.90 superblocks) or requested name (in the case
of version 1.x superblocks).  Foreign means that the array will be
started with the requested name + a suffix.  For example, version 0.90
superblock with preferred-minor of 0 would get created with a random
*actual* minor number and the name /dev/md0_0 or md0_1 if md0_0 already
exists, etc.  A version 1.x superblock with the name root would get
created as /dev/md/root_0.  Named foreign is used whenever a version 1.x
superblock can't be identified as local but it has a valid homehost
entry in the superblock.  The format attempt is /dev/md/homehost:name so
that if you were to mount an array from workstation2:root on
workstation1, it would be /dev/md/workstation2:root.

There is a special exception for version 1.x superblock arrays.  If the
name field of the superblock contains a specially formatted name, then
it will be treated as a request to create the device with a given minor
number and name identical to an old version 0.90 superblock array.
Those special case names are:
	a) a bare number (aka, 0)
	b) a bare name using standard number format (aka, md0 or md_d0)
	c) a full name using standard number format (aka, /dev/md0 or /dev/md_d0)

If an array uses a name instead of a number, then the named entry
created in /dev/md/ will be a symlink to a random numeric md device in
/dev/.  For example, /dev/md/root, since it's the first device started
and since we start grabbing md devices at 127 and counting backwards
when starting named devices, will almost always point to /dev/md127.
The /dev/md127 file will be the real device file while the entries in
/dev/md/ are always symlinks.  This is in order to be consistent with
the fact that our /sys/block entry will be md127 and our entry in
/proc/mdstat will also be md127.  This is because the current /sys/block
setup does not allow /sys/block/md/root, only md<number>.

_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/anaconda-devel-list

_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/anaconda-devel-list