Re: One OSD always dieing

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 15 Jan 2014 09:56:10 -0800



Hrm, at first glance that looks like the on-disk state got corrupted
somehow. If it's only one OSD which has this issue, I'd turn it off
and mark it out. Then if the cluster recovers properly, wipe it and
put it back in as a new OSD.
-Greg

On Wed, Jan 15, 2014 at 1:49 AM, Rottmann, Jonas (centron GmbH)
<J.Rottmann@xxxxxxxxxx> wrote:
> Hi,
>
>
>
> I now did an upgrade to dumpling (ceph version 0.67.5
> (a60ac9194718083a4b6a225fc17cad6096c69bd1)), but the osd still fails at
> startup with a trace.
>
>
>
> Heres the trace:
>
>
>
> http://paste.ubuntu.com/6755307/
>
>
>
> If you need any more infos I will provide them. Can someone please help?
>
>
>
> Thanks
>
>
>
> Von: ceph-users-bounces@xxxxxxxxxxxxxx
> [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] Im Auftrag von Rottmann, Jonas
> (centron GmbH)
> Gesendet: Montag, 30. Dezember 2013 09:30
> An: 'Andrei Mikhailovsky'
>
>
> Cc: ceph-users@xxxxxxxx
> Betreff: Re:  One OSD always dieing
>
>
>
> Hi Andrei,
>
>
>
> It is the first time I’m running into this. How to fix it? Upgrading with an
> not fully healthy cluster seams to be not so an great idea.
>
>
>
> After fixing it I will perform the upgrad ASAP.
>
>
>
> Thanks for your help so far.
>
>
>
> Von: Andrei Mikhailovsky [mailto:andrei@xxxxxxxxxx]
> Gesendet: Sonntag, 29. Dezember 2013 09:40
> An: Rottmann, Jonas (centron GmbH)
> Cc: ceph-users@xxxxxxxx
> Betreff: Re:  One OSD always dieing
>
>
>
>
>
> Jonas,
>
> I've seen this happening on a weekly basis when I was running 0.61 branch as
> well, however after switching to 0.67 branch it has stopped. Perhaps you
> should try upgrading
>
> Andrei
>
>
>
> ________________________________
>
> From: "Jonas Rottmann (centron GmbH)" <J.Rottmann@xxxxxxxxxx>
> To: "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx>
> Sent: Saturday, 28 December, 2013 9:48:12 AM
> Subject:  One OSD always dieing
>
> Hi,
>
>
>
> One of my OSDs are dieing all the time.  I rebooted one after one every node
> and assured that all has the same kernel version and glibc.
>
>
>
> I’m using ceph version 0.61.9 (7440dcd135750839fa0f00263f80722ff6f51e90).
>
>
>
> Dmesg only shows:
>
>
>
> [ 5745.366041] init: ceph-osd (ceph/3) main process (2510) killed by ABRT
> signal
>
> [ 5745.366235] init: ceph-osd (ceph/3) main process ended, respawning
>
> [ 5763.824298] init: ceph-osd (ceph/3) main process (2991) killed by SEGV
> signal
>
>
>
> Basically every time this shows up in the logs:
>
>
>
> 2013-12-28 06:35:08.489431 7fc9eccd5700 -1 osd/ReplicatedPG.cc: In function
> 'ReplicatedPG::RepGather* ReplicatedPG::trim_object(const hobject_t&)'
> thread 7fc9eccd5700 time 2013-12-28 06:35:08.487862
>
> osd/ReplicatedPG.cc: 1379: FAILED assert(0)
>
>
>
> If you need more infos I will send them. Please help ! The whole cluster
> isn’t working proberbly because of this…
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com