Re: Time to deprecate old RAID formats?

"John Stoffel" <john@xxxxxxxxxxx> · Mon, 22 Oct 2007 16:39:28 -0400

[ I was going to reply to this earlier, but the Red Sox and good
weather got into the way this weekend.  ;-]

>>>>> "Michael" == Michael Tokarev <mjt@xxxxxxxxxx> writes:

Michael> I'm doing a sysadmin work for about 15 or 20 years.

Welcome to the club!  It's a fun career, always something new to
learn. 

>> If you are going to mirror an existing filesystem, then by definition
>> you have a second disk or partition available for the purpose.  So you
>> would merely setup the new RAID1, in degraded mode, using the new
>> partition as the base.  Then you copy the data over to the new RAID1
>> device, change your boot setup, and reboot.

Michael> And you have to copy the data twice as a result, instead of
Michael> copying it only once to the second disk.

So?  Why is this such a big deal?  As I see it, there are two seperate
ways to setup a RAID1 setup, on an OS.

1.  The mirror is built ahead of time and you install onto the
    mirror.  And twice as much data gets written, half to each disk.
    *grin* 

2.  You are encapsulating an existing OS install and you need to do a
    reboot from the un-mirrored OS to the mirrored setup.  So yes, you
    do have to copy the data from the orig to the mirror, reboot, then
    resync back onto the original disk whish has been added into the the
    RAID set.  

Neither case is really that big a deal.  And with the RAID super block
at the front of the disk, you don't have to worry about mixing up
which disk is which.  It's not fun when you boot one disk, thinking
it's the RAID disk, but end up booting the original disk.  

>> As Doug says, and I agree strongly, you DO NOT want to have the
>> possibility of confusion and data loss, especially on bootup.  And

Michael> There are different point of views, and different settings
Michael> etc.  For example, I once dealt with a linux user who was
Michael> unable to use his disk partition, because his system (it was
Michael> RedHat if I remember correctly) recognized some LVM volume on
Michael> his disk (it was previously used with Windows) and tried to
Michael> automatically activate it, thus making it "busy".  What I'm
Michael> talking about here is that any automatic activation of
Michael> anything should be done with extreme care, using smart logic
Michael> in the startup scripts if at all.

Ah... but you can also de-active LVM partitions as well if you like.  

Michael> The Doug's example - in my opinion anyway - shows wrong tools
Michael> or bad logic in the startup sequence, not a general flaw in
Michael> superblock location.

I don't agree completely.  I think the superblock location is a key
issue, because if you have a superblock location which moves depending
the filesystem or LVM you use to look at the partition (or full disk)
then you need to be even more careful about how to poke at things.

This is really true when you use the full disk for the mirror, because
then you don't have the partition table to base some initial
guestimates on.  Since there is an explicit Linux RAID partition type,
as well as an explicit linux filesystem (filesystem is then decoded
from the first Nk of the partition), you have a modicum of safety.

If ext3 has the superblock in the first 4k of the disk, but you've
setup the disk to use RAID1 with the LVM superblock at the end of the
disk, you now need to be careful about how the disk is detected and
then mounted.

To the ext3 detection logic, it looks like an ext3 filesystem, to LVM,
it looks like a RAID partition.  Which is correct?  Which is wrong?
How do you tell programmatically?  

That's what I think that all superblocks should be in the SAME
location on the disk and/or partitions if used.  It keeps down
problems like this.  

Michael> Another example is ext[234]fs - it does not touch first 512
Michael> bytes of the device, so if there was an msdos filesystem
Michael> there before, it will be recognized as such by many tools,
Michael> and an attempt to mount it automatically will lead to at
Michael> least scary output and nothing mounted, or in fsck doing
Michael> fatal things to it in worst scenario.  Sure thing the first
Michael> 512 bytes should be just cleared.. but that's another topic.

I would argue that ext[234] should be clearing those 512 bytes.  Why
aren't they cleared  

Michael> Speaking of cases where it was really helpful to have an
Michael> ability to mount individual raid components directly without
Michael> the raid level - most of them was due to one or another
Michael> operator errors, usually together with bugs and/or omissions
Michael> in software.  I don't remember exact scenarious anymore (last
Michael> time it was more than 2 years ago).  Most of the time it was
Michael> one or another sort of system recovery.

In this case, you're only talking about RAID1 mirrors, no other RAID
configuration fits this scenario.  And while this might look to be
helpful, I would strongly argue that it's not, because it's a special
case of the RAID code and can lead to all kinds of bugs and problems
if it's not exercised properly. 

Michael> In almost all machines I maintain, there's a raid1 for the
Michael> root filesystem built of all the drives (be it 2 or 4 or even
Michael> 6 of them) - the key point is to be able to boot off any of
Michael> them in case some cable/drive/controller rearrangement has to
Michael> be done.  Root filesystem is quite small (256 or 512 Mb
Michael> here), and it's not too dynamic either -- so it's not a big
Michael> deal to waste space for it.

Sure, I agree.  I agree 100% here that mirroring your /root partition
across two or more disks is good.  

Michael> Problem occurs - obviously - when something goes wrong.  And
Michael> most of the time issues we had happened on a remote site,
Michael> where there was no expirienced operator/sysadmin handy.

That's why I like Rackable's RMM, I get full serial console access to
the BIOS and the system.  Very handy.  Or get a PC Weasel PCI card and
install it into such systems, or a remote KVM, etc.  If you have to
support remote systems, you need the infrastructure to properly
support it.  

Michael> For example, when one drive was almost dead, and mdadm tried
Michael> to bring the array up, machine just hanged for unknown amount
Michael> of time.  An unexpirienced operator was there.  Instead of
Michael> trying to teach him how to pass parameter to the initramfs to
Michael> stop trying to assemble root array and next assembling it
Michael> manually, I told him to pass "root=/dev/sda1" to the kernel.

I really don't like this, because you've now broken the RAID from the
underside and it's not clear which is the clean mirror half now.  

Michael> Root mounts read-only, so it should be a safe thing to do - I
Michael> only needed root fs and minimal set of services (which are
Michael> even in initramfs) just for it to boot up to SOME state where
Michael> I can log in remotely and fix things later.  (no I didn't
Michael> want to remove the drive yet, I wanted to examine it first,
Michael> and it turned to be a good idea because the hang was
Michael> happening only at the beginning of it, and while we tried to
Michael> install replacement and fill it up with data, there was an
Michael> unreadable sector found on another drive, so this old but not
Michael> removed drive was really handy).

Heh. I can see that but I honestly think this points back to a problem
LVM/MD have with failing disks, and that is they don't time out
properly when one half of the mirror is having problems.  

Michael> Another situation - after some weird crash I had to examine
Michael> the filesystems found on both components - I want to look
Michael> at the filesystems and compare them, WITHOUT messing up
Michael> with raid superblocks (later on I wrote a tiny program to
Michael> save/restore 0.90 superblocks), and without attempting a
Michael> reconstruction attempts.  In fact, this very case - examining
Michael> the contents - is something I've been doing many times for
Michael> one or another reason.  There's just no need to involve
Michael> raid layer here at all, but it doesn't disturb things either
Michael> (in some cases anyway).

This is where a netboot would be my preferred setup, or a LiveCD.
Booting off an OS half like this might seem like a good way to quickly
work around the problem, but I feel like you're breaking the
assumptions of the RAID setup and it's going to bite you worse one
day. 

Michael> Yet another - many times we had to copy an old system to
Michael> a new one - new machine boots with 3 drives in it, 2 new,
Michael> and 3rd (the boot one) from the old machine.  I boot it off
Michael> the non-raided config from the 3rd drive (using only the
Michael> halves of md devices), create new arrays on the 2 new
Michael> drives (note - had I started raid on the 3rd machine, there'd
Michael> be a problem with md device numbering, -- for consistency I
Michael> number all the partitions and raid arrays similarily on all
Michael> machines), and copy data over.  There's no need to do the
Michael> complex procedure of adding components to the existing raid
Michael> arrays, dropping the old drive from them and resizing the
Michael> stuff - because of the latter step (and because there's no
Michael> need to resync in the first place - the 2 new drives are
Michael> new, hence I use --no-resync because they're filled with
Michael> zeros anyway).

Michael> Another case - we had to copy large amount of data from one
Michael> machine to another, from a raid array.  I just pulled off the
Michael> disk (bitmaps=yes, and i remounted the filesystem readonly),
Michael> inserted it into another machine, mounted it - without raid -
Michael> here and did a copy.  Superblock was preserved, and when I
Michael> returned the drive back, everything was ok.

Michael> And so on.  There was countless number of cases like that,
Michael> something I forgot already too.

Michael> Well.  I know about a loop device which has "offset=XXX"
Michael> parameter, so one can actually see and use the "internals"
Michael> component of a raid1 array, even if the superblock is at the
Michael> beginning.  But see above, the very first case - go tell to
Michael> that operator how to do it all ;)

That's the real solution to me, using the loop device with an offset!
I keep forgetting about this too.  And I think it's a key thing. 

>> this leads to the heart of my initial post on this matter, that the
>> confusion of having four different variations of RAID superblocks is
>> bad.  We should deprecate them down to just two, the old 0.90 format,
>> and the new 1.x format at the start of the RAID volume.

Michael> It's confusing for sure.  But see: 0.90 format is the most
Michael> commonly used one, and the most important is that it's
Michael> historical - it was here for many years, many systems are
Michael> using it.  I don't want to come across a situation when, some
Michael> years later, I'll need to grab a data from my old disk and be
Michael> unable to, because 0.90 format isn't supported anymore.

I never said that we should drop *read* support for 0.90 format, I was
just suggesting that we make the 1.0 format with the superblock in the
first 4k of the partition be the default from now on.

Michael> 0.90 has some real limitations (like 26 components at max
Michael> etc), hence 1.x format appeared.  And various flavours of 1.x
Michael> format are all useful too.  For example, if you're concerned
Michael> about safety of your data due to defects(*) in your startup
Michael> scripts, -- use whatever 1.x format which puts the metadata
Michael> at the beginning.  That's just it, I think ;)

This should be the default in my mind, having the new RAID 1.x format
with the first 4k of the partition being the default and only version
of 1.x that's used going forward.  

Michael> (*) Note: software like libvolume-id (part of udev) is able
Michael> to recognize parts of raid 0.90 arrays just fine.

Michael> !DSPAM:471a4837229653661656215!
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html