Re: Our 0.94.2 OSD are not restarting : osd/PG.cc: 2856: FAILED assert(values.size() == 1)

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 27 Oct 2015 10:31:40 -0700

You might see if http://tracker.ceph.com/issues/13060 could apply to
your cluster. If so upgrading to .94.4 should fix it.

*Don't* reset your OSD journal. That is never the answer and is
basically the same as trashing the OSD in question.
-Greg

On Tue, Oct 27, 2015 at 9:59 AM, Laurent GUERBY <laurent@xxxxxxxxxx> wrote:
> Hi,
>
> After a host failure (and two disks failing within 8 hours)
> one of our OSD failed to start after boot with the following error:
>
> 0> 2015-10-26 08:15:59.923059 7f67f0cb2900 -1 osd/PG.cc: In function
> 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
> ceph::bufferlist*)' thread 7f67f0cb2900 time 2015-10-26 08:15:59.922041
> osd/PG.cc: 2856: FAILED assert(values.size() == 1)
>
> Full log attached here:
>
> http://tracker.ceph.com/issues/13594
>
> As noted this is similar to
>
> http://tracker.ceph.com/issues/4855
>
> Which was closed as cannot reproduce.
>
> After a second host failure we got a second
> OSD with the same error (we tried multiple times to restart), which is
> scary since our cluster is not that big and recovery
> takes a very long time.
>
> We'd like to restart these OSD, may be the
> start error is linked to the journal?
> Would it be sfe to reset the journal with:
>
> ceph-osd --mkjournal -i OSDNUM
>
> Thanks in advance for any help,
>
> Sincerely,
>
> Laurent
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com