Re: OSDs crashing

Josh Haft <paccrap@xxxxxxxxx> · Fri, 28 Sep 2018 11:24:55 -0500



Created: https://tracker.ceph.com/issues/36250

On Tue, Sep 25, 2018 at 9:08 PM Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>
> On Tue, Sep 25, 2018 at 11:31 PM Josh Haft <paccrap@xxxxxxxxx> wrote:
> >
> > Hi cephers,
> >
> > I have a cluster of 7 storage nodes with 12 drives each and the OSD
> > processes are regularly crashing. All 84 have crashed at least once in
> > the past two days. Cluster is Luminous 12.2.2 on CentOS 7.4.1708,
> > kernel version 3.10.0-693.el7.x86_64. I rebooted one of the OSD nodes
> > to see if that cleared up the issue, but it did not. This problem has
> > been going on for about a month now, but it was much less frequent
> > initially - I'd see a crash once every few days or so. I took a look
> > through the mailing list and bug reports, but wasn't able to find
> > anything resembling this problem.
> >
> > I am running a second cluster - also 12.2.2, CentOS 7.4.1708, and
> > kernel version 3.10.0-693.el7.x86_64 - but I do not see the issue
> > there.
> >
> > Log messages always look similar to the following, and I've pulled out
> > the back trace from a core dump as well. The aborting thread always
> > looks to be msgr-worker.
> >
>
> <SNIP>
>
> > #7  0x00007f9e731a3a36 in __cxxabiv1::__terminate (handler=<optimized
> > out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38
> > #8  0x00007f9e731a3a63 in std::terminate () at
> > ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
> > #9  0x00007f9e731fa345 in std::(anonymous
> > namespace)::execute_native_thread_routine (__p=<optimized out>) at
> > ../../../../../libstdc++-v3/src/c++11/thread.cc:92
>
> That is this code executing.
>
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/thread.cc;h=0351f19e042b0701ba3c2597ecec87144fd631d5;hb=cf82a597b0d189857acb34a08725762c4f5afb50#l76
>
> So the problem is we are generating an exception when our thread gets
> run, we should probably catch that before it gets to here but that's
> another story...
>
> The exception is "buffer::malformed_input: entity_addr_t marker != 1"
> and there is some precedent for this
> (https://tracker.ceph.com/issues/21660,
> https://tracker.ceph.com/issues/24819) but I don't think they are your
> issue.
>
> We generated that exception because we encountered an ill-formed
> entity_addr_t whilst decoding a message.
>
> Could you open a tracker for this issue and upload the entire log from
> a crash, preferably with "debug ms >= 5" but be careful as this will
> create very large log files. You can use ceph-post-file to upload
> large compressed files.
>
> Let me know the tracker ID here once you've created it.
>
> P.S. This is likely fixed in a later version of Luminous since you
> seem to be the only one hitting it. Either that or there is something
> unusual about your environment.
>
> >
> > Has anyone else seen this? Any suggestions on how to proceed? I do
> > intend to upgrade to Mimic but would prefer to do it when the cluster
> > is stable.
> >
> > Thanks for your help.
> > Josh
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com