On Wed, Jul 4, 2012 at 10:53 AM, Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> wrote: > Le 04/07/2012 18:21, Gregory Farnum a écrit : > >> On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: >>> >>> Le 03/07/2012 23:38, Tommi Virtanen a écrit : >>>> >>>> On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx >>>> (mailto:Yann.Dupont@xxxxxxxxxxxxxx)> wrote: >>>>> >>>>> In the case I could repair, do you think a crashed FS as it is right >>>>> now is >>>>> valuable for you, for future reference , as I saw you can't reproduce >>>>> the >>>>> problem ? I can make an archive (or a btrfs dump ?), but it will be >>>>> quite >>>>> big. >>>> >>>> At this point, it's more about the upstream developers (of btrfs >>>> etc) >>>> than us; we're on good terms with them but not experts on the on-disk >>>> format(s). You might want to send an email to the relevant mailing >>>> lists before wiping the disks. >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> (mailto:majordomo@xxxxxxxxxxxxxxx) >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> Well, I probably wasn't clear enough. I talked about crashed FS, but >>> i >>> was talking about ceph. The underlying FS (btrfs in that case) of 1 node >>> (and only one) has PROBABLY crashed in the past, causing corruption in >>> ceph data on this node, and then the subsequent crash of other nodes. >>> RIGHT now btrfs on this node is OK. I can access the filesystem without >>> errors. >>> For the moment, on 8 nodes, 4 refuse to restart . >>> 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem >>> with the underlying fs as far as I can tell. >>> So I think the scenario is : >>> One node had problem with btrfs, leading first to kernel problem , >>> probably corruption (in disk/ in memory maybe ?) ,and ultimately to a >>> kernel oops. Before that ultimate kernel oops, bad data has been >>> transmitted to other (sane) nodes, leading to ceph-osd crash on thoses >>> nodes. >> >> I don't think that's actually possible — the OSDs all do quite a lot of >> interpretation between what they get off the wire and what goes on disk. >> What you've got here are 4 corrupted LevelDB databases, and we pretty much >> can't do that through the interfaces we have. :/ > > > ok, so as all nodes were identical, I probably have hit a btrfs bug (like a > erroneous out of space ) in more or less the same time. And when 1 osd was > out, > >> >>> >>> If you think this scenario is highly improbable in real life (that is, >>> btrfs will probably be fixed for good, and then, corruption can't >>> happen), it's ok. >>> But I wonder if this scenario can be triggered with other problem, and >>> bad data can be transmitted to other sane nodes (power outage, out of >>> memory condition, disk full... for example) >>> That's why I proposed you a crashed ceph volume image (I shouldn't have >>> talked about a crashed fs, sorry for the confusion) >> >> I appreciate the offer, but I don't think this will help much — it's a >> disk state managed by somebody else, not our logical state, which has >> broken. If we could figure out how that state got broken that'd be good, but >> a "ceph image" won't really help in doing so. > > ok, no problem. I'll restart from scratch, freshly formated. > >> >> I wonder if maybe there's a confounding factor here — are all your nodes >> similar to each other, > > > Yes. I designed the cluster that way. All nodes are identical hardware > (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to > storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD) Oh, interesting. Are the broken nodes all on the same set of arrays? > > >> or are they running on different kinds of hardware? How did you do your >> Ceph upgrades? What's ceph -s display when the cluster is running as best it >> can? > > > Ceph was running 0.47.2 at that time - (debian package for ceph). After the > crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without > success. > > Nothing particular for upgrades, because for the moment ceph is broken, so > just apt-get upgrade with new version. > > > ceph -s show that : > > root@label5:~# ceph -s > health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32 > pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale; > 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%); > 1814/1245570 unfound (0.146%) > monmap e1: 3 mons at > {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, > election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa > osdmap e2404: 8 osds: 3 up, 3 in > pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5 > active+recovering+remapped, 32 active+clean+replay, 11 > active+recovering+degraded, 25 active+remapped, 710 down+peering, 222 > active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering, > 20 stale+active+degraded, 6 down+remapped+peering, 8 > stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB > used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%); > 1814/1245570 unfound (0.146%) > mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby Okay, that looks about how I'd expect if half your OSDs are down. > > > > BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of the 4 > surviving OSD didn't complete : > > 2012-07-04 10:13:27.291541 7f8711099780 -1 filestore(/CEPH/data/osd.1) > FileStore::mount : stale version stamp detected: 2. Proceeding, do_update is > set, performing disk format upgrade. > 2012-07-04 10:13:27.291618 7f8711099780 0 filestore(/CEPH/data/osd.1) mount > found snaps <3744666,3746725> > > then , nothing happens for hours, iotop show constant disk usage : > 6069 be/4 root 0.00 B/s 32.09 M/s 0.00 % 19.08 % ceph-osd -i 1 > --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf > > strace show lots of syscall like this : > > [pid 6069] pread(25, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4101, > 94950) = 4101 > [pid 6069] pread(23, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4107, > 49678) = 4107 > [pid 6069] pread(36, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4110, > 99797) = 4110 > [pid 6069] pread(37, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4105, 8211) > = 4105 > [pid 6069] pread(25, "\0C\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4121, > 99051) = 4121 > [pid 6069] pread(36, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4173, > 103907) = 4173 > [pid 6069] pread(37, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4169, > 12316) = 4169 > [pid 6069] pread(37, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4130, > 16485) = 4130 > [pid 6069] pread(36, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4129, > 108080) = 4129 Sam, does this look like something of ours to you? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html