Re: Time to deprecate old RAID formats?

Michael Tokarev <mjt@xxxxxxxxxx> · Tue, 23 Oct 2007 02:29:24 +0400

John Stoffel wrote:

>>>>>> "Michael" == Michael Tokarev <mjt@xxxxxxxxxx> writes:

>>> If you are going to mirror an existing filesystem, then by definition
>>> you have a second disk or partition available for the purpose.  So you
>>> would merely setup the new RAID1, in degraded mode, using the new
>>> partition as the base.  Then you copy the data over to the new RAID1
>>> device, change your boot setup, and reboot.
> 
> Michael> And you have to copy the data twice as a result, instead of
> Michael> copying it only once to the second disk.
> 
> So?  Why is this such a big deal?  As I see it, there are two seperate
> ways to setup a RAID1 setup, on an OS.
[..]
that was just a tiny nitpick, so to say, about a particular way to
convert existing system into raid1 - not something which's done every
day anyway.  Still, double the time for copying your terabyte-sized
drive is something to consider.

[]
> Michael> automatically activate it, thus making it "busy".  What I'm
> Michael> talking about here is that any automatic activation of
> Michael> anything should be done with extreme care, using smart logic
> Michael> in the startup scripts if at all.
> 
> Ah... but you can also de-active LVM partitions as well if you like.  

Yes, esp. being a newbie user who first installed linux on his PC just
to see that he can't use his disk.. ;)  That was a real situation - I
helped someone who had never heard of LVM and did little of anything
with filesystems/disks before.

> Michael> The Doug's example - in my opinion anyway - shows wrong tools
> Michael> or bad logic in the startup sequence, not a general flaw in
> Michael> superblock location.
> 
> I don't agree completely.  I think the superblock location is a key
> issue, because if you have a superblock location which moves depending
> the filesystem or LVM you use to look at the partition (or full disk)
> then you need to be even more careful about how to poke at things.

Superblock location does not depend on the filesystem.  Raid exports
the "inside" space only, excluding superblocks, to the next level
(filesystem or else).

> This is really true when you use the full disk for the mirror, because
> then you don't have the partition table to base some initial
> guestimates on.  Since there is an explicit Linux RAID partition type,
> as well as an explicit linux filesystem (filesystem is then decoded
> from the first Nk of the partition), you have a modicum of safety.

Speaking of whole disks - first, don't do that (for reasons suitable
for another topic), and second, using the whole disk or partitions
makes no real difference whatsoever to the topic being discussed.

There's just no need for the guesswork, except for the first install
(to automatically recognize existing devices, and to use them after
confirmation), and maybe for rescue systems, which again is a different
topic.

In any case, for a tool that does a guesswork (like libvolume-id, to
create /dev/ symlinks), it's as easy to look at the end of the device
as to the beginning or to any other fixed place - since the tool has
to know the superblock format, it knows superblock location as well).

Maybe "manual guesswork", based on hexdump of first several kilobytes
of data, is a bit more difficult in case where superblock is located
at the end.  But if one has to analyze hexdump, he doesn't care about
raid anymore.

> If ext3 has the superblock in the first 4k of the disk, but you've
> setup the disk to use RAID1 with the LVM superblock at the end of the
> disk, you now need to be careful about how the disk is detected and
> then mounted.

See above.  For tools, it's trivial to distinguish a component of a
raid volume from the volume itself, by looking for superblock at whatever
location.  Including stuff like mkfs, which - like mdadm does - may
warn one about previous filesystem/volume information on the device
in question.

> Michael> Speaking of cases where it was really helpful to have an
> Michael> ability to mount individual raid components directly without
> Michael> the raid level - most of them was due to one or another
> Michael> operator errors, usually together with bugs and/or omissions
> Michael> in software.  I don't remember exact scenarious anymore (last
> Michael> time it was more than 2 years ago).  Most of the time it was
> Michael> one or another sort of system recovery.
> 
> In this case, you're only talking about RAID1 mirrors, no other RAID
> configuration fits this scenario.  And while this might look to be

Definitely.  However, linear - to some extent - can be used partially.
But sure with much less usefulness.

However, raid1 is much more common setup than anything else - IMHO anyway.
It's the cheapest and the most reliable thing for an average user anyway -
it's cheaper to get 2 large drives than to, say, 3 a bit smaller drives.
Yes, raid1 has 1/2 space "wasted", compared with, say, raid5 on top of 3
drives (only 1/3 wasted), but still 3 smallish drives costs more than
2 larger drives.

> helpful, I would strongly argue that it's not, because it's a special
> case of the RAID code and can lead to all kinds of bugs and problems
> if it's not exercised properly. 

I'd say the key here is "if not excercised properly", not the superblock
location...

But we're now discussing somewhat different thing.  It's historical: unix
always allowed to `rm -rf /' (well, almost - there used to be cases when
it wasn't possible to remove a file of a running executable - EBUSY).
For example, windows does not allow one to do such evil things, and alot
of other similar stuff.  And for some strange reason I find unix to be
much more flexible and useful... oh well.  The question here is whenever
an OS should think for/instead of the user or not.  Sure thing, tools
should not be dumb, and mdadm is probably a nice example of an intelligent
tool (i mean mdadm --create, which looks at the devices and warns you if
it thinks there may be something sensible in there).  But what's nice about
it is that when necessary, I'm able to do whatever I want to (and I actually
used `rm -rf /' once, for good.. but that was more for fun than for real,
removing some old test install while running it).

> Michael> Problem occurs - obviously - when something goes wrong.  And
> Michael> most of the time issues we had happened on a remote site,
> Michael> where there was no expirienced operator/sysadmin handy.
> 
> That's why I like Rackable's RMM, I get full serial console access to
> the BIOS and the system.  Very handy.  Or get a PC Weasel PCI card and
> install it into such systems, or a remote KVM, etc.  If you have to
> support remote systems, you need the infrastructure to properly
> support it.  

It isn't always possible.  A situation we've here is - alot of tiny
remote offices all around, in the city and around it, some 100Km away.
There's a single machine in each, who does communication tasks too -
by means of second ethernet card or a dialup modem when a good
connectivity isn't an option for whatever reason.  Maybe it'd be a
good idea to install tiny routers in each location, like e.g
linksys wrts ($80 or so each), but it doesn't buy much really --
because the downtime of servers is very small (for about 10 years
this setup is working at ~60 places, only 3 or 4 times we needed
to go to the place to fix things).

> Michael> For example, when one drive was almost dead, and mdadm tried
> Michael> to bring the array up, machine just hanged for unknown amount
> Michael> of time.  An unexpirienced operator was there.  Instead of
> Michael> trying to teach him how to pass parameter to the initramfs to
> Michael> stop trying to assemble root array and next assembling it
> Michael> manually, I told him to pass "root=/dev/sda1" to the kernel.
> 
> I really don't like this, because you've now broken the RAID from the
> underside and it's not clear which is the clean mirror half now.  

Not clear?  Why, in God's sake??

Well.  One can speculate here about instability of device nodes and
other such things, that after next reboot sda may suddenly switch
its device node with sdb...  But that's irrelevant.  Even if the fs
is mounted readwrite and one component of raid has been modified behind
md code, I can trivially fix things after the fact, even after the next
reboot the md device will come back out of all components, not noticing
some components aren't the same.  Yes, thing will be badly broken in
case i'd change the filesystem significantly when it's mounted on the
component of raid.  But the thing is that I know what's going on, and
I will ensure things will be ok.

> Michael> Root mounts read-only, so it should be a safe thing to do - I
> Michael> only needed root fs and minimal set of services (which are
> Michael> even in initramfs) just for it to boot up to SOME state where
> Michael> I can log in remotely and fix things later.  (no I didn't
> Michael> want to remove the drive yet, I wanted to examine it first,
> Michael> and it turned to be a good idea because the hang was
> Michael> happening only at the beginning of it, and while we tried to
> Michael> install replacement and fill it up with data, there was an
> Michael> unreadable sector found on another drive, so this old but not
> Michael> removed drive was really handy).
> 
> Heh. I can see that but I honestly think this points back to a problem
> LVM/MD have with failing disks, and that is they don't time out
> properly when one half of the mirror is having problems.  

Yes - see above.  "Things works when everything works.  But problem
occurs than something doesn't work as intended - be it hardware,
software bugs or operator errors"- something like that.

But the thing is - the bug - let's assume it was an error handling bug -
prevented the system from operating correctly.  The system provided some
business-critical tasks, and it had to be bought up.  I had a very simple
way to do it, to bring it up to a state when people was able to work with
it as usual, and WHEN to look at bugs/patches/whatever.  Without even
going on-site, after 15-minute phone talk, and within 20 minutes it was
running.

> Michael> Another situation - after some weird crash I had to examine
> Michael> the filesystems found on both components - I want to look
> Michael> at the filesystems and compare them, WITHOUT messing up
> Michael> with raid superblocks (later on I wrote a tiny program to
> Michael> save/restore 0.90 superblocks), and without attempting a
> Michael> reconstruction attempts.  In fact, this very case - examining
> Michael> the contents - is something I've been doing many times for
> Michael> one or another reason.  There's just no need to involve
> Michael> raid layer here at all, but it doesn't disturb things either
> Michael> (in some cases anyway).
> 
> This is where a netboot would be my preferred setup, or a LiveCD.
> Booting off an OS half like this might seem like a good way to quickly
> work around the problem, but I feel like you're breaking the
> assumptions of the RAID setup and it's going to bite you worse one
> day. 

"Be careful when you use force".  You can cut your finger or even
kill yourself or someone else with a sharp knife, but it doesn't mean
we should forbid knifes.

> Michael> Well.  I know about a loop device which has "offset=XXX"
> Michael> parameter, so one can actually see and use the "internals"
> Michael> component of a raid1 array, even if the superblock is at the
> Michael> beginning.  But see above, the very first case - go tell to
> Michael> that operator how to do it all ;)
> 
> That's the real solution to me, using the loop device with an offset!
> I keep forgetting about this too.  And I think it's a key thing. 

Till this very thread, I didn't think about putting loop.ko into
my typical initramfs image... ;)

[]
> I never said that we should drop *read* support for 0.90 format, I was
> just suggesting that we make the 1.0 format with the superblock in the
> first 4k of the partition be the default from now on.

That's probably ok.  Not really sure what it buys us for real (again,
looking at superblock at the end of the device like libvolume-id does
is trivial, but sure, not every "block device guesser" out there
implemented this logic yet.  That to say: there are bugs still in
other components (startup scripts, those guessers, ..) which may
confuse something and your components will be used improperly, and
placing the supeblock at the beginning prevents some of them from
triggering.  Ditto for user by mistake doing evil things - with
superblock in front it becomes FAR less risky for the user -- who
uses force -- to do whatever he wants to, be it accidentially or
for real.

By the way.  The ability to mount/use component device of a raid1
array independently of raid was a key point when I first used it.
The code (circa 1998) was new and buggy, it wasn't part of the
official kernel, and sure I was afraid to use it.  But since I
knew I can always backoff trivially by just removing md layer and
one drive, I went on and deployed this stuff instead of expensive
hardware raid solution.  This in turn allowed us to complete the
project, instead of rolling it back due to lack of money...

> Michael> 0.90 has some real limitations (like 26 components at max
> Michael> etc), hence 1.x format appeared.  And various flavours of 1.x
> Michael> format are all useful too.  For example, if you're concerned
> Michael> about safety of your data due to defects(*) in your startup
> Michael> scripts, -- use whatever 1.x format which puts the metadata
> Michael> at the beginning.  That's just it, I think ;)
> 
> This should be the default in my mind, having the new RAID 1.x format
> with the first 4k of the partition being the default and only version
> of 1.x that's used going forward.  

ok, at least 1.1 should be supported -- v.1 with superblock at the end.
I for one will use it when I come across limitations/whatever of 0.90
format. ;)

Thanks for the reply!

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html