Re: ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux

Hans de Goede <hdegoede@xxxxxxxxxx> · Tue, 01 Dec 2009 14:48:22 +0100

Hi,

On 11/28/2009 02:02 AM, Doug Ledford wrote:
On 11/26/2009 04:31 AM, Hans de Goede wrote:
Hi Doug,

That is a lot of information in there, let me try to summarize it
and please let me know if I've missed anything:

1) The default chunksize for raid4/5/6 is changing, this should
    not be a problem as we do not specify a chunksize when creating
    new arrays

I thought we did specify a chunksize.  Oh well, that just means our
default raid array performance will improve dramatically.  The old
default of 64k was horrible for performance relative to the new 512k
default.

	4 disks on MB	5 disks on MB	4 disks on PM
	write	read	write	read	write	read
64K	509.373	388.870	403.947	370.963	103.743	61.127
512K	502.123	498.510	460.817	487.720	113.897	111.980

MB = Motherboard ports
PM = single eSATA port to a port multiplier
Note: going from 4 disks to 5 disks on this one machine resulted in a
performance drop which is a likely indicator that there were bus
saturation issues between the memory subsystem and the southbridge and
that 5 disks simply over saturated the southbridge's capacity.

2) The default bitmap chunk size changed, again not a problem as
    we don't use bitmaps in anaconda atm

3) We need to change the not using of a bitmap, we should use a bitmap
    by default except when the array will be used for /boot or swap.

Correct.  The typical /boot array is too small to worry about, it can
usually be resynced in its entirety in a matter of seconds.  Swap
partitions shouldn't use a bitmap because we don't want the overhead of
sync operations on the swap subsystem, especially since its data is
generally speaking transient.  Other filesystems, especially once you
get to 10GB or larger, can benefit from the bitmap in the event of an
improper shutdown.

    Questions:
    1) What commandline option should we pass to "mdadm --create" to
       achieve this?

--bitmap={none,internal}

In the future if we opt for something other than the default bitmap
chunk, then when the above is internal, we would also pass:

--bitmap-chunk=<chunksize in KB, default is 65536>

Ok, I'll try to write a patch for this next week (this week I've some
parted stuff that needs doing).

4) We need to start specifying a superblock version, and preferably
    version 1.1

No, we *must* start specifying a superblock version or else we will no
longer be able to boot our machines after a clean install.  The new
default is 1.1, and I'm perfectly happy to use that as the default, but
as far as I'm aware, the only boot loader that can use a 1.1 superblock
based raid1 /boot partition is grub2, so all the other arches would not
be able to boot and we would have to forcibly upgrade all systems using
grub to grub2.

5) Specifying a superblock version of 1.1 will render systems non
    bootable, I assume this only applies to systems which have
    a raid1 /boot, so I guess that we need to specify a superblock
    version of 1.1, except when the raid set will be used for /boot,
    where we should keep using 0.9

    Questions:
    1) Is the above correct ?

No, not quite.  You can use superblock version 1.0 on /boot and grub
will then work.  Both version 0.90 and version 1.0 superblocks are at
the end of the device and do not confuse boot loaders.  Here's a summary
of superblock format differences:

Ok, so for /boot we must specify a superblock version, should we use 1.0 or
0.9 (I assume 1.0, but confirmation of that would be good).

<snip>

6) When creating 1.1 superblock sets we need to pass in:
    --homehost=<hostname>
    --name=<devicename>
    -e{1.0,1.1,1.2}

    Questions
    1) Currently when creating a set, we do for example:
       mdadm --create /dev/md0 --run --level=1 --raid-devices=2 /dev/sda1
/dev/sdb1

       What would this look like with the new mdadm, esp, what would
happen to the
       /dev/md0 argument ?

The /dev/md0 argument is arbitrary.  It could be /dev/md0, it could be
/dev/md/foobar.  However, if we insist on sticking with the old numbered
device files, then it is certain that we should also do our best to make
sure that the --name field we pass in is in the special format needed to
get mdadm to automatically assume we want numbered devices.  In this
case, --name=0 would be appropriate.

But this actually ignores a real situation that some of us use to get
around the brokenness of anaconda for many releases now.  I typically
start any install by first burning the install image to CD, then booting
into rescue mode, then hand running fdisk on all my disks to get the
layout I want, then hand creating md raid arrays with the options I
want, then hand creating filesystems on those arrays or swap spaces on
those arrays with the options I want.  Then I reboot in the install mode
on the same CD, and when it gets to the disk layout, I specify custom
layout and then I simply use all the filesystems and md raid devices I
created previously.  However, even if I use version 1.x superblocks, and
even if I use named md raid arrays, anaconda always insists on ignoring
the names I've given them and assigning them numbers.  Of course, the
numbers don't necessarily match up to the order in which I created them,
so I have to guess at which numbered array corresponds to which named
array (unless there are obvious hints like different sizes, but in the
last instance I was doing this I had 7 arrays that were all the same
size, each intended to be a root filesystem for a different version of
either RHEL or Fedora).  Then, once the install is all complete, I have
to go back into rescue mode, remount the root filesystem, hand edit the
mdadm.conf to use names instead of numbers, remake the initrd images
(now dracut images), change any fstab entries, then I can finally use
the names.  Really, it's *very* annoying that this minor number
dependence in anaconda has gone on so long.  It was outdated 7 or 8
Fedora releases ago.

Then you should have asked us to change this 7 or 8 releases ago, changing
this so close to RHEL-6 is just not going to happen.

    If we can still specify which minor to use when creating a new array,
even though
    that minor may change after the first reboot, then the amount of
changes needed
    to the installer are minimal and we can likely do this without
problems for RHEL-6.

I don't understand.  Please enlighten me as to these requirements on
minor numbers in the installer.  After all, it's not like there isn't a
simple means of naming these things:

If md raid device used for lvm pv, name it /dev/md/pv-#
If md raid device used for swap, name it /dev/md/swap-#
If md raid device used for /, name it /dev/md/root
If md raid device used for any other data partition, name it
/dev/md/<basename of mount point>

And it's not like anaconda doesn't already have that information
available when its creating filesystem labels, so I'm curious why it's
so hard to use names instead of numbers for arrays in anaconda?

It is not that hard, but currently all mdraid code inside anaconda is
based on the assumption that they are identified by their minor, changing
this takes time, time we do not have before RHEL-6.

So fixing this will have to wait till Fedora 14 I'm afraid.

Regards,

Hans

p.s.

Can you please reply to bug 537329 one more time, I've tried to explain
why I think that we can simplify mdraid activation in the proposed way
despite your objections. If you insist on keeping things as is, that is fine
too, then I'll come up with a separate solution for the Intel BIOS RAID
problems the current activation setup causes.

_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/anaconda-devel-list