You might see if http://tracker.ceph.com/issues/13060 could apply to your cluster. If so upgrading to .94.4 should fix it. *Don't* reset your OSD journal. That is never the answer and is basically the same as trashing the OSD in question. -Greg On Tue, Oct 27, 2015 at 9:59 AM, Laurent GUERBY <laurent@xxxxxxxxxx> wrote: > Hi, > > After a host failure (and two disks failing within 8 hours) > one of our OSD failed to start after boot with the following error: > > 0> 2015-10-26 08:15:59.923059 7f67f0cb2900 -1 osd/PG.cc: In function > 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, > ceph::bufferlist*)' thread 7f67f0cb2900 time 2015-10-26 08:15:59.922041 > osd/PG.cc: 2856: FAILED assert(values.size() == 1) > > Full log attached here: > > http://tracker.ceph.com/issues/13594 > > As noted this is similar to > > http://tracker.ceph.com/issues/4855 > > Which was closed as cannot reproduce. > > After a second host failure we got a second > OSD with the same error (we tried multiple times to restart), which is > scary since our cluster is not that big and recovery > takes a very long time. > > We'd like to restart these OSD, may be the > start error is linked to the journal? > Would it be sfe to reset the journal with: > > ceph-osd --mkjournal -i OSDNUM > > Thanks in advance for any help, > > Sincerely, > > Laurent > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com