Re: disappearing luks header and other mysteries

Arno Wagner <arno@xxxxxxxxxxx> · Mon, 15 Sep 2014 16:51:20 +0200

On Mon, Sep 15, 2014 at 08:12:40 CEST, Ross Boylan wrote:
> There seem to be problems below the dm-crypt layer.  Both LVM volume groups report 
> Incorrect metadata area header checksum
> 
> Since the architecture is, starting at the top and working down,
> file system
> crypto (sometimes)
> LVM volume  <- LUKS at this level
> LVM volume group <- corruption at this level
> partition of the RAID virtual disk
> RAID
> physical disk partition
> 
> The RAID at least still seems intact.
> 
> I'm at a loss about what could have corrupted both volume groups, on
> separate physical disks.  Things did seem to start going bad after the
> snapshot filled (6:30), though it was an hour later I got the first
> file system error.

As it worked initially, and then stopped working after data was
written, I suspect containers are overlapping, possibly because 
there is some misalignment between containers on different layers 
and data that goes into one container corrupts another one.

If I see this correcly, you have

1. Partition
2. RAID
3. LVM 
4. LUKS

That is decidedly too many. KISS is not even in the building
anymore with that. I know, likely the distro gave you something 
like this, but really it is a symptom of a failed enginering 
mind-set that keeps stacking up complexity until things fail.

> Maybe the crash a couple of days ago corrupted some key 
> operating system files.

That should not happen, even with an overly complex set-up with 
RAID, LVM and LUKS. All have their meta-data on disk and it is
usually not written or atomically written. And if you use user-space
RAID assembly (metadata 1.0, 1.1 or 1.2), that should either work 
or fail, but not result in corruption. (Unless somebody did 
something really, really stupid, like storing RAID geometry in 
a file and then enforcing it ...)

> At any rate, if the volume groups are bad I suspect I'm toast and need
> to go to remote backups.  There is some stuff about recovering the VG
> header on the net, but even if that succeeds it would be hard to trust
> the rest of the file systems.

Given the complexity, I don't think you can reasonably make 
sure you repaired things even if you find errors. 

I would recomend a complete rebuild, and, if possible, without 
LVM. The only real benefit of LVM here is things like dynamic 
resizing, but as your experience shows that is more of a 
theoretical thing anyways. I suspect it fails more often than 
not and is only useful when you need to do it online, but also 
have the time/budget to go through several test-runs on an 
identical test machine before to make sure it works, and also
the time to make really, really sure it has worked by analyzing 
things in detail after each test-run...

Really, partition->RAID->LUKS and partition->RAID should
be quite enough. I have used that for 12 years with 
excellent reliability including in a cluster set-up. I would 
also recommend going with the old superblock 0.90 format for 
RAID and kernel-level autodetection (partition 0xfc). That 
increases reliability further as there is no dependency on 
some user-space software or configuration for RAID assembly.
That also has the benefit that any kernel with RAID 
auto-assembly can assemble the RAIDs, for example one from 
a rescue-CD.

Arno
-- 
Arno Wagner,     Dr. sc. techn., Dipl. Inform.,    Email: arno@xxxxxxxxxxx
GnuPG: ID: CB5D9718  FP: 12D6 C03B 1B30 33BB 13CF  B774 E35C 5FA1 CB5D 9718
----
A good decision is based on knowledge and not on numbers. -- Plato

If it's in the news, don't worry about it.  The very definition of 
"news" is "something that hardly ever happens." -- Bruce Schneier
_______________________________________________
dm-crypt mailing list
dm-crypt@xxxxxxxx
http://www.saout.de/mailman/listinfo/dm-crypt