Re: ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux

Doug Ledford <dledford@xxxxxxxxxx> · Fri, 27 Nov 2009 20:02:11 -0500

On 11/26/2009 04:31 AM, Hans de Goede wrote:
> Hi Doug,
> 
> That is a lot of information in there, let me try to summarize it
> and please let me know if I've missed anything:
> 
> 1) The default chunksize for raid4/5/6 is changing, this should
>    not be a problem as we do not specify a chunksize when creating
>    new arrays

I thought we did specify a chunksize.  Oh well, that just means our
default raid array performance will improve dramatically.  The old
default of 64k was horrible for performance relative to the new 512k
default.

	4 disks on MB	5 disks on MB	4 disks on PM
	write	read	write	read	write	read
64K	509.373	388.870	403.947	370.963	103.743	61.127
512K	502.123	498.510	460.817	487.720	113.897	111.980

MB = Motherboard ports
PM = single eSATA port to a port multiplier
Note: going from 4 disks to 5 disks on this one machine resulted in a
performance drop which is a likely indicator that there were bus
saturation issues between the memory subsystem and the southbridge and
that 5 disks simply over saturated the southbridge's capacity.

> 2) The default bitmap chunk size changed, again not a problem as
>    we don't use bitmaps in anaconda atm
> 
> 3) We need to change the not using of a bitmap, we should use a bitmap
>    by default except when the array will be used for /boot or swap.

Correct.  The typical /boot array is too small to worry about, it can
usually be resynced in its entirety in a matter of seconds.  Swap
partitions shouldn't use a bitmap because we don't want the overhead of
sync operations on the swap subsystem, especially since its data is
generally speaking transient.  Other filesystems, especially once you
get to 10GB or larger, can benefit from the bitmap in the event of an
improper shutdown.

>    Questions:
>    1) What commandline option should we pass to "mdadm --create" to
>       achieve this?

--bitmap={none,internal}

In the future if we opt for something other than the default bitmap
chunk, then when the above is internal, we would also pass:

--bitmap-chunk=<chunksize in KB, default is 65536>

> 4) We need to start specifying a superblock version, and preferably
>    version 1.1

No, we *must* start specifying a superblock version or else we will no
longer be able to boot our machines after a clean install.  The new
default is 1.1, and I'm perfectly happy to use that as the default, but
as far as I'm aware, the only boot loader that can use a 1.1 superblock
based raid1 /boot partition is grub2, so all the other arches would not
be able to boot and we would have to forcibly upgrade all systems using
grub to grub2.

> 5) Specifying a superblock version of 1.1 will render systems non
>    bootable, I assume this only applies to systems which have
>    a raid1 /boot, so I guess that we need to specify a superblock
>    version of 1.1, except when the raid set will be used for /boot,
>    where we should keep using 0.9
> 
>    Questions:
>    1) Is the above correct ?

No, not quite.  You can use superblock version 1.0 on /boot and grub
will then work.  Both version 0.90 and version 1.0 superblocks are at
the end of the device and do not confuse boot loaders.  Here's a summary
of superblock format differences:

Version 0.90:
	Stored at end of device
	Has no homehost field in the superblock but most recent versions of
mdadm would hash the name of the machine and use that for half of the
UUID, which provided a pseudo homehost entry
	Limited to 27 constituent devices
	Has no name field in the superblock
	Has a preferred-minor field in the superblock
	Does not contain sufficient information to distinguish between a
superblock at the end of a whole device or a superblock at the end of a
single partition on the whole device (aka, create a single partition on
a drive that uses the whole drive, place a version 0.90 superblock on
that drive, then you will be able to pass in either the whole disk or
the partition to an mdadm assemble command and mdadm can't tell via the
information in the superblock if you have passed in the right device).

Common to all version 1.x superblocks:
	Has homehost and name fields (actually, one field with a max length of
32 chars)
	Full UUID is generated, none hashed, so more bits of randomness on UUID
	No limit to number of constituent devices
	Has no preferred-minor field in the superblock, but can be emulated by
use of appropriate entry in name field

Version 1.0:
	Located at end of device where version 0.90 superblocks are also located
	Contains sufficient information to differentiate between being a
superblock for the whole device or just a partition on the device

Version 1.1:
	Located at very beginning of device.  If placed on a whole disk device,
occupies the same space as the MBR and partition table and does not
leave room for them.  Data is offset after superblock, and as such the
normal device can not be used to access the data, only the md device.

Version 1.2:
	Located at beginning of device + 4K.  This offset allows for the MBR
and partition table to have the first 4K.  This can, however, cause
confusing situations when used on whole disk devices as you are able to
partition the device, but the entire device is the raid device, so the
partition is meaningless even if present.  It does, however, allow for
booting off of these devices (theoretically, I don't think anyone is
doing so and I suspect even grub2 would need more work to make this
operational).

> 6) When creating 1.1 superblock sets we need to pass in:
>    --homehost=<hostname>
>    --name=<devicename>
>    -e{1.0,1.1,1.2}
> 
>    Questions
>    1) Currently when creating a set, we do for example:
>       mdadm --create /dev/md0 --run --level=1 --raid-devices=2 /dev/sda1
> /dev/sdb1
> 
>       What would this look like with the new mdadm, esp, what would
> happen to the
>       /dev/md0 argument ?

The /dev/md0 argument is arbitrary.  It could be /dev/md0, it could be
/dev/md/foobar.  However, if we insist on sticking with the old numbered
device files, then it is certain that we should also do our best to make
sure that the --name field we pass in is in the special format needed to
get mdadm to automatically assume we want numbered devices.  In this
case, --name=0 would be appropriate.

But this actually ignores a real situation that some of us use to get
around the brokenness of anaconda for many releases now.  I typically
start any install by first burning the install image to CD, then booting
into rescue mode, then hand running fdisk on all my disks to get the
layout I want, then hand creating md raid arrays with the options I
want, then hand creating filesystems on those arrays or swap spaces on
those arrays with the options I want.  Then I reboot in the install mode
on the same CD, and when it gets to the disk layout, I specify custom
layout and then I simply use all the filesystems and md raid devices I
created previously.  However, even if I use version 1.x superblocks, and
even if I use named md raid arrays, anaconda always insists on ignoring
the names I've given them and assigning them numbers.  Of course, the
numbers don't necessarily match up to the order in which I created them,
so I have to guess at which numbered array corresponds to which named
array (unless there are obvious hints like different sizes, but in the
last instance I was doing this I had 7 arrays that were all the same
size, each intended to be a root filesystem for a different version of
either RHEL or Fedora).  Then, once the install is all complete, I have
to go back into rescue mode, remount the root filesystem, hand edit the
mdadm.conf to use names instead of numbers, remake the initrd images
(now dracut images), change any fstab entries, then I can finally use
the names.  Really, it's *very* annoying that this minor number
dependence in anaconda has gone on so long.  It was outdated 7 or 8
Fedora releases ago.

>    If we can still specify which minor to use when creating a new array,
> even though
>    that minor may change after the first reboot, then the amount of
> changes needed
>    to the installer are minimal and we can likely do this without
> problems for RHEL-6.

I don't understand.  Please enlighten me as to these requirements on
minor numbers in the installer.  After all, it's not like there isn't a
simple means of naming these things:

If md raid device used for lvm pv, name it /dev/md/pv-#
If md raid device used for swap, name it /dev/md/swap-#
If md raid device used for /, name it /dev/md/root
If md raid device used for any other data partition, name it
/dev/md/<basename of mount point>

And it's not like anaconda doesn't already have that information
available when its creating filesystem labels, so I'm curious why it's
so hard to use names instead of numbers for arrays in anaconda?

> Regards,
> 
> Hans
> 
> 
> 
> 
> 
> 
> On 11/26/2009 03:59 AM, Doug Ledford wrote:
>> Please keep me on the Cc: as I'm not on this list.
>>
>> Upstream recently released mdadm-3.1.1, which I intend to include in
>> Fedora soon.  It finally updates three default settings that should have
>> been updated a long time ago.
>>
>> The default chunk size for raid4/5/6 is now 512K.  Anaconda needs to be
>> updated to either leave the default alone or use 512K itself.  In the
>> past it has passed in 256K, but extensive performance testing shows that
>> 512K is indeed the sweet spot on pretty much any SATA device, which
>> simply due to SATA being the overwhelming majority of disks we run on
>> today, it's sweet spot should be our default.
>>
>> It updates the default bitmap chunk to be at least 65536K when using an
>> internal bitmap.  Performance tests showed as much as a 10% performance
>> penalty for the old default bitmap chunk (8192K).  The new bitmap chunk
>> reduces that performance penalty (although we don't have solid numbers
>> on how much...I'll work on that).  However, we've never used a bitmap by
>> default on any arrays we create.  That needs to change.  The simple
>> logic is this: no bitmap on /boot or any swap partitions, use a bitmap
>> on anything else.  If we need a bitmap chunk other than the default,
>> I'll follow up here.
>>
>> It updates the default superblock format from the old, antiquated,
>> deprecated version 0.90 superblock that we should have quit using years
>> ago to version 1.1.  This is the real kicker.  Since anaconda has never
>> actively set the superblock metadata version (even though we should have
>> been using 1.1 long ago), it's now going to have to start.  The reason
>> is that unless you upgrade machines to use an md raid aware boot loader,
>> such as grub2 for x86 although I have no idea what would work on non-x86
>> arches, version 1.1 superblocks will render all installs unbootable.
>> More importantly though, unless the anaconda team decides to blindly set
>> all superblocks back to the old version 0.90 format, this change
>> necessitates more than just a change to controlling which version of 1.x
>> superblock we use on any given array, but also a change to how we create
>> and name arrays in general.  Version 0.90 superblocks are from back in
>> the day when we thought it was smart/reasonable to name arrays by number
>> and to mount scsi devices in fstab by their /dev/ entry.  That day has
>> long since been gone, dead and buried.  We switched filesystems to mount
>> by label so they are immune to device number changes and similarly
>> version 1.x superblocks totally do away with the preferred-minor field
>> in the superblock.  Instead, they have a homehost and name field that
>> are used to control device *naming*, not numbering, and in a properly
>> running version 1.x superblock system, the device numbers are not
>> guaranteed to be static from boot to boot (although they usually are).
>> This doesn't appear to be much problem for dracut, but as an example,
>> I'm attaching the mkinitrd patch I have to apply to an F11 system after
>> every mkinitrd update in order to get initrd images that mount by name
>> properly.
>>
>> So, those are the major differences.  Switching to any of the version
>> 1.x superblocks necessitates that anaconda pass a few arguments that it
>> hasn't in the past.  Right now, these are the things anaconda is going
>> to need to start passing in on any mdadm create commands (that I don't
>> currently believe it does, but I haven't checked and could be wrong):
>>
>> --homehost=<hostname>
>> --name=<devicename>
>> -e{1.0,1.1,1.2}
>>
>> In addition, we should start passing the bitmap option as I outlined
>> above.
>>
>> We will also likely need to set the HOMEHOST entry in mdadm.conf and
>> possibly the AUTO entry in mdadm.conf as well.
>>
>> And this brings me to a different point.  Hans asked me to comment on
>> bz537329.  I would suggest people look at my comments there for some
>> additional explanation of why ideas like trying to make things work
>> without mdadm.conf are probably a bad idea.
>>
>> So here are a few additional things that I think are worth taking into
>> consideration.
>>
>> If an array is listed in mdadm.conf, then *every* item on the array line
>> must match the array or else it will fail to start.  This means that
>> ARRAY lines that list things that can change by using mdadm --grow to
>> change aspects of the array can result in the array failing to be found
>> on the next reboot.  Therefore, it would be best if each new ARRAY line
>> we write includes nothing besides the name of the array, the metadata
>> version, and the UUID.
>>
>> If an array is listed in mdadm.conf, then both the --homehost and --name
>> settings will be overridden by the name in the mdadm.conf file, so do
>> not depend on either having an effect for arrays listed in mdadm.conf.
>>
>> However, homehost and name are both used heavily any time the array is
>> not listed in mdadm.conf so setting them correctly is still important.
>> There are a number of common scenarios that make this important: you are
>> carrying an array from machine to machine (like an external drive tower,
>> or raid1 usb flash drive, etc.), when an array is visible to multiple
>> hosts (like arrays built over SAN devices), or when you've built a
>> machine to replace an existing machine and you temporarily install the
>> drives from the machine being replaced in the new machine to copy data
>> across in which case you are starting both your new array and the old
>> array on the same machine.  They are also relied upon heavily in order
>> to attempt to satisfy those people that think the md raid stack should
>> work without any mdadm.conf file at all.  And there is a special case
>> exception in the name field that is used to attempt to preserve back
>> compatibility.  The intersection of all these attempts to satisfy
>> various needs is tricky.  Here's how names are determined:
>>
>> 1) If the array is identified in mdadm.conf, the name from the ARRAY
>> line is used.
>> 2) If HOMEHOST has been set in the config
>>     a) If the array uses a version 0.90 superblock, check to see if the
>> HOMEHOST has been encoded in the UUID via hash.  If not, treat as
>> foreign, if so, treat as local.
>>     b) For version 1.x superblocks check the homehost in the superblock
>> against the set homehost.  If they match, treat as local, else if the
>> homehost in the superblock is not empty treat as named foreign else
>> treat as foreign.
>> 3) else
>>     a) for version 0.90 superblocks treat the array as foreign.
>>     b) for 1.x if homehost is set then named foreign else foreign.
>>
>> In case #1, the name as it's in the file is used.  If the remainder of
>> cases, local means to attempt to create the array with the requested
>> number (in the case of 0.90 superblocks) or requested name (in the case
>> of version 1.x superblocks).  Foreign means that the array will be
>> started with the requested name + a suffix.  For example, version 0.90
>> superblock with preferred-minor of 0 would get created with a random
>> *actual* minor number and the name /dev/md0_0 or md0_1 if md0_0 already
>> exists, etc.  A version 1.x superblock with the name root would get
>> created as /dev/md/root_0.  Named foreign is used whenever a version 1.x
>> superblock can't be identified as local but it has a valid homehost
>> entry in the superblock.  The format attempt is /dev/md/homehost:name so
>> that if you were to mount an array from workstation2:root on
>> workstation1, it would be /dev/md/workstation2:root.
>>
>> There is a special exception for version 1.x superblock arrays.  If the
>> name field of the superblock contains a specially formatted name, then
>> it will be treated as a request to create the device with a given minor
>> number and name identical to an old version 0.90 superblock array.
>> Those special case names are:
>>     a) a bare number (aka, 0)
>>     b) a bare name using standard number format (aka, md0 or md_d0)
>>     c) a full name using standard number format (aka, /dev/md0 or
>> /dev/md_d0)
>>
>> If an array uses a name instead of a number, then the named entry
>> created in /dev/md/ will be a symlink to a random numeric md device in
>> /dev/.  For example, /dev/md/root, since it's the first device started
>> and since we start grabbing md devices at 127 and counting backwards
>> when starting named devices, will almost always point to /dev/md127.
>> The /dev/md127 file will be the real device file while the entries in
>> /dev/md/ are always symlinks.  This is in order to be consistent with
>> the fact that our /sys/block entry will be md127 and our entry in
>> /proc/mdstat will also be md127.  This is because the current /sys/block
>> setup does not allow /sys/block/md/root, only md<number>.
>>
>>
>>
>>
>> _______________________________________________
>> Anaconda-devel-list mailing list
>> Anaconda-devel-list@xxxxxxxxxx
>> https://www.redhat.com/mailman/listinfo/anaconda-devel-list

-- 
Doug Ledford <dledford@xxxxxxxxxx>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/anaconda-devel-list