Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

David Zeuthen <david@xxxxxxxx> · Tue, 14 Jul 2009 11:00:32 -0400

On Tue, 2009-07-14 at 10:14 -0400, Doug Ledford wrote:
> On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote:
> > Hi,
> > On 07/14/2009 03:39 PM, Doug Ledford wrote:
> >> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
> >>> Hi,
> >>>
> >>> As you probably know I'm working on making Fedora 12 use mdraid
> >>> instead of dmraid for Intel BIOS-RAID setups.
> >>>
> >>> The installer (anaconda) part is mostly done (needs more testing)
> >>> and now I'm looking at implementing support for this in dracut
> >>> (the new mkinitrd for Fedora 12).
> >>>
> >>> So I've been testing how this works for both imsm mdraid sets
> >>> and native mdraid metadata sets, in both cases using a 2 disk
> >>> mirror, so that the set can also be brought up in degraded mode.
> >>>
> >>> Currently the udev rules use incremental assembly like this:
> >>> mdadm -I /dev/mdraid-member
> >>
> >> Hmmm...does dracut use udev during initramfs time?
> >
> > Yes, it uses udev for everything, making discovery of / consistent
> > with the discovery of other storage devices.
> 
> I'm not sure I like or agree with that philosophy.  I absolutely  
> *don't* want my / filesystem or raid device treated like some plug in,  
> temporary, roaming raid device.  They *aren't* the same, not in terms  
> of importance to the running of the machine and not in terms of  
> reliability requirements.  By using mdadm -A in the mkinitrd calls, I  
> was able to put in an mdadm.conf file and limit what arrays get  
> started to arrays found non-ambiguously in that mdadm.conf file and  
> identified by UUID.  When you switch to incremental assembly for root,  
> you risk the possibility of name space collisions and non- 
> deterministic bring up of your / array.

I'm concerned about this too. To be more specific, I'm concerned about
both automatically assembling things like RAID arrays / LVM logical
volumes and also automounting devices [1].

Anyway, my point with all this is that maybe we are going about things
wrong in the initramfs. My understanding is that dracut roughly works
this way (please let me know if this is wrong)

 1. when generating the initramfs image, we leave information in
    the kernel command-line about the root filesystem - typically
    the UUID - e.g. root=UUID=786263c4-5e28-4cdc-97b8-1ab6e221c344

 2. when the initramfs starts, we trigger all uevents and wait for
    things to settle

 3. Autoassembly / magic:

    - If we see e.g. md components, we activate them via udev rules
    - If we see e.g. LUKS devices, we unlock them (by interacting with
      the user asking for the passphrase) via udev rules.
    - Ditto for e.g. LVM

 5. if we see the rootfs (matching on e.g. the UUID passed on the
    kernel command line) we create the /dev/root symlink

 6. when the system has settled (e.g. no more uevents) we mount
    /dev/root and transition to non-early user space. If there
    is no /dev/root link, we bail out

Now, my beef is 3. above. I think it is way too optimistic to just
auto-assemble / unlock etc. everything. E.g. we end up doing a lot of
work not related to the rootfs that is better done in non-early user
space.

Instead, just like we specify the UUID for rootfs on the command-line,
we need to leave some instructions to the initramfs logic on _exactly_
what things should be autoassembled / unlocked / etc. in order to find
the rootfs. So the kernel command-line wouldn't really be "just" the
UUID of rootfs; it would be a whole recipe of actions to do. E.g.

 ROOTFS=UUID=1234          \ # this the UUID of my rootfs
 MD_ASSEMBLE=UUID=4567     \ # assemble MD array with UUID 4567
 LUKS_UNLOCK=UUID=89ab       # unlock LUKS device with UUID 89ab

which would work for e.g. cases where rootfs is on a LUKS device which
is on a MD array. In other words, we'd need a whole "recipe" passed to
the initramfs (the mkinitrd tool would generate this recipe), not just
the UUID of the rootfs.

Coincidentally, if we had something like this and the format of the
"recipe" was documented somewhere, it would be easy to e.g. implement
"rescue" functionality as described here

http://www.redhat.com/archives/fedora-desktop-list/2009-July/msg00019.html

since graphical disk utilities would just find /etc/grub.conf (or
similar), read the recipe and then start assembling/unlocking bits and
mount them as appropriate in /mnt/rescue/.

Actually this is very close to what Doug is asking for when he says
(paraphrased) "just include mdadm.conf instead of this magic". The key
difference, however, is that the user _won't_ have to use mdadm.conf or
care about config files - it's all taken care of by the mkinitrd binary
when building the recipe. This is a good thing as having one less config
file to worry about is good.

Thanks for considering, and sorry for the long mail,
David

[1] : As some background information, I've spent a good chunk of my
life, five years or so, dealing with end users complaining about how
plain block devices got automounted when they were plugged in. FWIW, the
complaints ranges from both non-sensical (irritated users: "these
desktop kids shall not decide how UNIX works") to actual bugs where the
on-disk contents were mis-detected and either something wrong got
automounted or we failed to automount at all.

If I've learned anything it's that you need to be very very careful here
- unlike Windows and other operating systems with such capabilities,
Linux is.. different.. mostly because we support so many different ways
to put a file system through things likd md and dm. And you need to make
it very easy to turn things like this off.

--
To unsubscribe from this list: send the line "unsubscribe linux-hotplug" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html