On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> wrote: > Well, I probably wasn't clear enough. I talked about crashed FS, but i was > talking about ceph. The underlying FS (btrfs in that case) of 1 node (and > only one) has PROBABLY crashed in the past, causing corruption in ceph data > on this node, and then the subsequent crash of other nodes. > > RIGHT now btrfs on this node is OK. I can access the filesystem without > errors. But the LevelDB isn't. It's contents got corrupted, somehow somewhere, and it really is up to the LevelDB library to tolerate those errors; we have a simple get/put interface we use, and LevelDB is triggering an internal error. > One node had problem with btrfs, leading first to kernel problem , probably > corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops. > Before that ultimate kernel oops, bad data has been transmitted to other > (sane) nodes, leading to ceph-osd crash on thoses nodes. The LevelDB binary contents are not transferred over to other nodes; this kind of corruption would not spread over the Ceph clustering mechanisms. It's more likely that you have 4 independently corrupted LevelDBs. Something in the workload Ceph runs makes that corruption quite likely. The information here isn't enough to say whether the cause of the corruption is btrfs or LevelDB, but the recovery needs to handled by LevelDB -- and upstream is working on making it more robust: http://code.google.com/p/leveldb/issues/detail?id=97 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html