Peter T. Breuer wrote: []
Let's focus on the personal machine of mine for now since it uses Linux software RAID and therefore on-topic here. It has /boot on a small RAID-1,
This is always a VERY bad idea. /boot and /root want to be on as simple and uncomplicated a system as possible. Moreover, they never change, so what is the point of having a real time mirror for them? It would be sufficient to copy them every day (which is what I do) at file system level to another partition, if you want a spare copy for emergencies.
Raid1 (mirror) is the most "trivial" raid level out there, especially having in mind that the underlying devices -- all of them -- contains (or should, in theory -- modulo the "50% chance of any difference being unnoticied" etc) exact copy of the filesystem. Also, root (and /boot -- i for one have both /boot in root in a single small filesystem) do change -- not that often but often enouth so that "newaliases problem" (when you "forgot" to backup it after a change) happens from time to time.
After several years of expirience with alot of systems (and alot of various disk failure scenarios too: when you have many systems, you have good chances to see a failure ;), I now use very simple and (so far) reliable approach, which I explained here on this list before. You have several (we use 2, 3 or 4) disks which are the same (or almost: eg some 36Gb disks are really 35Gb or 37Gb; in case they're differ, "extra" space on large disk isn't used); root and /boot are on small raid1 partition which is mirrored on *every* disk; swap is on raid1; the rest (/usr, /home, /var etc) are on raid5 arrays (maybe also raid0 for some "scratch" space). This way, you have "equal" drives, and *any* drive, including boot one, may fail at any time and the system will continue working as if all where working, including reboot (except of a (very rare in fact) failure scenario when your boot disk has failed MBR or other sectors required to boot, but "the rest" of that disk is working, in which case you'll need physical presence to bring the machine up). All the drives are "symmetrical", usage patterns for all drives are the same, and due to usage of raid arrays, load is spread among them quite nicely. You're free to reorder the drives in any way you want, to replace any of them (maybe rearranging the rest if you're replacing the boot drive) and so on.
Yes, root fs does not changes often, and yes it is small enouth (I use 1Gb, or 512Mb, or even 256Mb for root fs - not a big deal to allocate that space on every of 2 or 3 or 4 or 5 disks). So it isn't quite relevant how fast the filesystem will be on writes, and hence it's ok to place it on raid1 composed from 5 components. The stuff just works, it is very simple to administer/support, and does all the "backups" automatically. In case of some problem (yes I dislike any additional layers for critical system components as any layer may fail to start during boot etc), you can easily bring the system up by booting off the underlying root-raid partiton to repair the system -- all the utilities are here. More, you can boot from one disk (without raid) and try to repair root fs on another drive (if things are really screwed up), and when you're done, bring the raid up on that repaired partition and add other drives to the array.
To summarize: having /boot and root on raid1 is a very *good* idea. ;) It saved our data alot of times in the past few years already.
If you're worried about "silent data corruption" due to different data being read from different components of the raid array.. Well, first of all, we never saw that yet (we have quite good "testcase") (and no, I'm not saying it's impossible ofcourse). On rarely-changed filesystem, with real drives which does no silent remapping of an undeadable blocks to new place with the data on them becoming all-0s, without drives with uncontrollable write caching (quite common for IDE drives) and things like that, and with real memory (ECC I mean), where you *know* what you're writing to each disk (yes, there's also another possible cause of a problem: software errors aka bugs ;), that case with different data on different drives becomes quite.. rare. In order to be really sure, one can mount -o remount,ro / and just compare all components of the root raid, periodically. When there's more than 2 components on that array, it should be easy to determine which drive is "lying" in case of any difference. I do similar procedure on my systems during boot.
There is nowhere that is not software RAID to put the journals, so
Well, you can make somewhere. You only require an 8MB (one cylinder) partition.
Note scsi disks in linux only supports up to 14 partitions, which isn't sometimes sufficient even without additional partitions for journal. When you have large amount of disks (so having that "fully-symmetrical" layout as I described above becomes impractical), you can use one set of drives for data and another set of drives for journal for that data. When you only have 4 (or less) drives...
And yes I'm aware of mdp devices (partitions inside the raid arrays).. but that's just another layer "which may fail": if raid5 array won't start, I at least can reconstruct filesystem image by reading chunks of data from appropriate places from all drives and try to recover that image; with any additional structure inside the array (and the lack of "loopP" aka partitioned loop devices) it becomes more and more tricky to recover any data (from this point of view, raid1 is the niciest raid level ;)
Again: instead of using a partition for the journal, use (another?) raid array. This way, the system will work if the drive wich contains the journal fails. Note above about swap: in all my systems, swap is also on raid (raid1 in this case). At the first look, that can be a nonsense: having swap on raid. But we had enouth cases when due to a failed drive swap becomes corrupt (unreadable really), and the system goes havoc, *damaging* other data which was unaffected by the disk failure! With swap on raid1, the system continues working if any drive fails, which is good. (Older kernels, esp. 2.2.* series, had several probs with swap on raid, but that has been fixed now; there where other bugs fixed too (incl. bugs in ext3fs) so there should be no such damage to other data due to unreadable swap.. hopefully. But I can't trust my systems anymore after seeing (2 times in 4 years) what can happen with the data...)
[]
And I also want to "re-reply" to the first your message in this thread, where I was saying that "it's a nonsense that raid does not preserve write ordering". Ofcourse I mean not write ordering but working write barriers (as Neil pointed out, md subsystem does not implement write barriers directly but the concept is "emulated" by linux block subsystem). Write barriers should be sufficient to implement journalling safely.
/mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html