You can try searching the archives and tracker.ceph.com for hints about repairing these issues, but your disk stores have definitely been corrupted and it's likely to be an adventure. I'd recommend examining your local storage stack underneath Ceph and figuring out which part was ignoring barriers. -Greg On Fri, Feb 20, 2015 at 10:39 AM, Jeff <jeff@xxxxxxxxxxxxxxxxxxx> wrote: > Should I infer from the silence that there is no way to recover from the > > "FAILED assert(last_e.version.version < e.version.version)" errors? > > Thanks, > Jeff > > ----- Forwarded message from Jeff <jeff@xxxxxxxxxxxxxxxxxxx> ----- > > Date: Tue, 17 Feb 2015 09:16:33 -0500 > From: Jeff <jeff@xxxxxxxxxxxxxxxxxxx> > To: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Power failure recovery woes > > Some additional information/questions: > > Here is the output of "ceph osd tree" > > Some of the "down" OSD's are actually running, but are "down". For example > osd.1: > > root 30158 8.6 12.7 1542860 781288 ? Ssl 07:47 4:40 > /usr/bin/ceph-osd --cluster=ceph -i 0 -f > > Is there any way to get the cluster to recognize them as being up? osd-1 has > the "FAILED assert(last_e.version.version < e.version.version)" errors. > > Thanks, > Jeff > > > # id weight type name up/down reweight > -1 10.22 root default > -2 2.72 host ceph1 > 0 0.91 osd.0 up 1 > 1 0.91 osd.1 down 0 > 2 0.9 osd.2 down 0 > -3 1.82 host ceph2 > 3 0.91 osd.3 down 0 > 4 0.91 osd.4 down 0 > -4 2.04 host ceph3 > 5 0.68 osd.5 up 1 > 6 0.68 osd.6 up 1 > 7 0.68 osd.7 up 1 > 8 0.68 osd.8 down 0 > -5 1.82 host ceph4 > 9 0.91 osd.9 up 1 > 10 0.91 osd.10 down 0 > -6 1.82 host ceph5 > 11 0.91 osd.11 up 1 > 12 0.91 osd.12 up 1 > > On 2/17/2015 8:28 AM, Jeff wrote: >> >> >> -------- Original Message -------- >> Subject: Re: Power failure recovery woes >> Date: 2015-02-17 04:23 >> From: Udo Lembke <ulembke@xxxxxxxxxxxx> >> To: Jeff <jeff@xxxxxxxxxxxxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx >> >> Hi Jeff, >> is the osd /var/lib/ceph/osd/ceph-2 mounted? >> >> If not, does it helps, if you mounted the osd and start with >> service ceph start osd.2 >> ?? >> >> Udo >> >> Am 17.02.2015 09:54, schrieb Jeff: >>> Hi, >>> >>> We had a nasty power failure yesterday and even with UPS's our small (5 >>> node, 12 OSD) cluster is having problems recovering. >>> >>> We are running ceph 0.87 >>> >>> 3 of our OSD's are down consistently (others stop and are restartable, >>> but our cluster is so slow that almost everything we do times out). >>> >>> We are seeing errors like this on the OSD's that never run: >>> >>> ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) >>> Operation not permitted >>> >>> We are seeing errors like these of the OSD's that run some of the time: >>> >>> osd/PGLog.cc: 844: FAILED assert(last_e.version.version < >>> e.version.version) >>> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide >>> timeout") >>> >>> Does anyone have any suggestions on how to recover our cluster? >>> >>> Thanks! >>> Jeff >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ----- End forwarded message ----- > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com