On Mon, Sep 15, 2014 at 08:12:40 CEST, Ross Boylan wrote: > There seem to be problems below the dm-crypt layer. Both LVM volume groups report > Incorrect metadata area header checksum > > Since the architecture is, starting at the top and working down, > file system > crypto (sometimes) > LVM volume <- LUKS at this level > LVM volume group <- corruption at this level > partition of the RAID virtual disk > RAID > physical disk partition > > The RAID at least still seems intact. > > I'm at a loss about what could have corrupted both volume groups, on > separate physical disks. Things did seem to start going bad after the > snapshot filled (6:30), though it was an hour later I got the first > file system error. As it worked initially, and then stopped working after data was written, I suspect containers are overlapping, possibly because there is some misalignment between containers on different layers and data that goes into one container corrupts another one. If I see this correcly, you have 1. Partition 2. RAID 3. LVM 4. LUKS That is decidedly too many. KISS is not even in the building anymore with that. I know, likely the distro gave you something like this, but really it is a symptom of a failed enginering mind-set that keeps stacking up complexity until things fail. > Maybe the crash a couple of days ago corrupted some key > operating system files. That should not happen, even with an overly complex set-up with RAID, LVM and LUKS. All have their meta-data on disk and it is usually not written or atomically written. And if you use user-space RAID assembly (metadata 1.0, 1.1 or 1.2), that should either work or fail, but not result in corruption. (Unless somebody did something really, really stupid, like storing RAID geometry in a file and then enforcing it ...) > At any rate, if the volume groups are bad I suspect I'm toast and need > to go to remote backups. There is some stuff about recovering the VG > header on the net, but even if that succeeds it would be hard to trust > the rest of the file systems. Given the complexity, I don't think you can reasonably make sure you repaired things even if you find errors. I would recomend a complete rebuild, and, if possible, without LVM. The only real benefit of LVM here is things like dynamic resizing, but as your experience shows that is more of a theoretical thing anyways. I suspect it fails more often than not and is only useful when you need to do it online, but also have the time/budget to go through several test-runs on an identical test machine before to make sure it works, and also the time to make really, really sure it has worked by analyzing things in detail after each test-run... Really, partition->RAID->LUKS and partition->RAID should be quite enough. I have used that for 12 years with excellent reliability including in a cluster set-up. I would also recommend going with the old superblock 0.90 format for RAID and kernel-level autodetection (partition 0xfc). That increases reliability further as there is no dependency on some user-space software or configuration for RAID assembly. That also has the benefit that any kernel with RAID auto-assembly can assemble the RAIDs, for example one from a rescue-CD. Arno -- Arno Wagner, Dr. sc. techn., Dipl. Inform., Email: arno@xxxxxxxxxxx GnuPG: ID: CB5D9718 FP: 12D6 C03B 1B30 33BB 13CF B774 E35C 5FA1 CB5D 9718 ---- A good decision is based on knowledge and not on numbers. -- Plato If it's in the news, don't worry about it. The very definition of "news" is "something that hardly ever happens." -- Bruce Schneier _______________________________________________ dm-crypt mailing list dm-crypt@xxxxxxxx http://www.saout.de/mailman/listinfo/dm-crypt