Re: [RFC PATCH 2/3] raid: external and internal metadata support

Dan Williams <dan.j.williams@xxxxxxxxx> · Fri, 6 Feb 2009 12:19:50 -0700

On Fri, Feb 6, 2009 at 11:21 AM, Bill Nottingham <notting@xxxxxxxxxx> wrote:
> Dan Williams (dan.j.williams@xxxxxxxxx) said:
>> > - kernel tells userspace "hey, i unmounted this"
>> > (userspace freaks out because the filesystem the daemon is running on
>> >  just went away)
>> mdmon does not know or care if the *filesystem* is read-only.  It is
>> reading and writing /proc, /sys, and the raw disk devices.
>
> Your daemon has to be running from somewhere. That tends to be
> reduced to the initramfs that you've already deleted and switchrooted
> away from (in which case, good luck on upgrades of your userspace tools -
> you're stuck with the version in the initramfs you booted from, even
> if your later userspace tools end up using some later protocol).

Actually no, your not necessarily stuck with the mdmon from boot.  In
a pinch you could "mdmon /proc/mdstat /".  Worse case you need to
re-dracut and reboot, but that is already more flexible than the
metadata handled in kernel-space approach.

> You can't run it from the rootfs, because that's a chicken/egg
> scenario (or you'll switch to it, and then be unable to mark it
> clean, because you can't mark the array r/o, because the filesystem
> is r/w, which you can't undo because the daemon is running on
> it...)

Array r/o is a separate issue from the raid metadata clean bit, see below.

> Long-lived daemons running from the initramfs aren't really good.
I agree, and I initially looked for ways to wait until the rootfs was
available before launching mdmon... then I hit the xfs journal
recovery case.

> We don't run udev that way.
At first glance it looks like plymouth is run this way, but I am
probably mistaken.  From dracut/init:
[ -x /bin/plymouth ] && /bin/plymouth --newroot=$NEWROOT

One might say "just set the dirty bit, terminate, and wait for the
mdmon in the rootfs to take over".  The problem is that a disk could
fail in this window, and this event needs to be handled before the
kernel does anything else to the array.

>
>> > - userspace frobs bit in superblock to say 'This array is CLEAN!'
>> ...not in this scenario no.
>
> Then when is the clean bit set?
The clean bit can be set as soon as the parity data is in sync with
the data on the other drives.  We typically wait for some period of
write-inactivity to avoid needlessly touching the metadata after every
write.

>> > So, why isn't the ext* journal or filesystem unclean flag
>> > handled via a userspace file monitoring daemon, then?
>>
>> I'm not trying to be obtuse, but because it isn't.  Put another way,
>> consider what extra tools the initramfs would need if we wanted to
>> support an ntfs-3g rootfs.
>
> You're asking for *the exact same thing*... just RAID specific.
>

The key difference being that there are performance reasons for
handling filesystem metadata in the kernel.  Raid metadata events are
always in the slow path.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html