On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: > Le 03/07/2012 23:38, Tommi Virtanen a écrit : > > On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx (mailto:Yann.Dupont@xxxxxxxxxxxxxx)> wrote: > > > In the case I could repair, do you think a crashed FS as it is right now is > > > valuable for you, for future reference , as I saw you can't reproduce the > > > problem ? I can make an archive (or a btrfs dump ?), but it will be quite > > > big. > > > > > > At this point, it's more about the upstream developers (of btrfs etc) > > than us; we're on good terms with them but not experts on the on-disk > > format(s). You might want to send an email to the relevant mailing > > lists before wiping the disks. > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx) > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Well, I probably wasn't clear enough. I talked about crashed FS, but i > was talking about ceph. The underlying FS (btrfs in that case) of 1 node > (and only one) has PROBABLY crashed in the past, causing corruption in > ceph data on this node, and then the subsequent crash of other nodes. > > RIGHT now btrfs on this node is OK. I can access the filesystem without > errors. > > For the moment, on 8 nodes, 4 refuse to restart . > 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem > with the underlying fs as far as I can tell. > > So I think the scenario is : > > One node had problem with btrfs, leading first to kernel problem , > probably corruption (in disk/ in memory maybe ?) ,and ultimately to a > kernel oops. Before that ultimate kernel oops, bad data has been > transmitted to other (sane) nodes, leading to ceph-osd crash on thoses > nodes. I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we have. :/ > > If you think this scenario is highly improbable in real life (that is, > btrfs will probably be fixed for good, and then, corruption can't > happen), it's ok. > > But I wonder if this scenario can be triggered with other problem, and > bad data can be transmitted to other sane nodes (power outage, out of > memory condition, disk full... for example) > > That's why I proposed you a crashed ceph volume image (I shouldn't have > talked about a crashed fs, sorry for the confusion) I appreciate the offer, but I don't think this will help much — it's a disk state managed by somebody else, not our logical state, which has broken. If we could figure out how that state got broken that'd be good, but a "ceph image" won't really help in doing so. I wonder if maybe there's a confounding factor here — are all your nodes similar to each other, or are they running on different kinds of hardware? How did you do your Ceph upgrades? What's ceph -s display when the cluster is running as best it can? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html