Our 0.94.2 OSD are not restarting : osd/PG.cc: 2856: FAILED assert(values.size() == 1)

Laurent GUERBY <laurent@xxxxxxxxxx> · Tue, 27 Oct 2015 17:59:38 +0100

Hi,

After a host failure (and two disks failing within 8 hours)
one of our OSD failed to start after boot with the following error:

0> 2015-10-26 08:15:59.923059 7f67f0cb2900 -1 osd/PG.cc: In function
'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
ceph::bufferlist*)' thread 7f67f0cb2900 time 2015-10-26 08:15:59.922041
osd/PG.cc: 2856: FAILED assert(values.size() == 1)

Full log attached here:

http://tracker.ceph.com/issues/13594

As noted this is similar to 

http://tracker.ceph.com/issues/4855

Which was closed as cannot reproduce.

After a second host failure we got a second
OSD with the same error (we tried multiple times to restart), which is
scary since our cluster is not that big and recovery
takes a very long time.

We'd like to restart these OSD, may be the
start error is linked to the journal?
Would it be sfe to reset the journal with:

ceph-osd --mkjournal -i OSDNUM

Thanks in advance for any help,

Sincerely,

Laurent

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com