Re: OSDs crashing

Brad Hubbard <bhubbard@xxxxxxxxxx> · Wed, 26 Sep 2018 12:08:12 +1000

On Tue, Sep 25, 2018 at 11:31 PM Josh Haft <paccrap@xxxxxxxxx> wrote:
>
> Hi cephers,
>
> I have a cluster of 7 storage nodes with 12 drives each and the OSD
> processes are regularly crashing. All 84 have crashed at least once in
> the past two days. Cluster is Luminous 12.2.2 on CentOS 7.4.1708,
> kernel version 3.10.0-693.el7.x86_64. I rebooted one of the OSD nodes
> to see if that cleared up the issue, but it did not. This problem has
> been going on for about a month now, but it was much less frequent
> initially - I'd see a crash once every few days or so. I took a look
> through the mailing list and bug reports, but wasn't able to find
> anything resembling this problem.
>
> I am running a second cluster - also 12.2.2, CentOS 7.4.1708, and
> kernel version 3.10.0-693.el7.x86_64 - but I do not see the issue
> there.
>
> Log messages always look similar to the following, and I've pulled out
> the back trace from a core dump as well. The aborting thread always
> looks to be msgr-worker.
>

<SNIP>

> #7  0x00007f9e731a3a36 in __cxxabiv1::__terminate (handler=<optimized
> out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38
> #8  0x00007f9e731a3a63 in std::terminate () at
> ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
> #9  0x00007f9e731fa345 in std::(anonymous
> namespace)::execute_native_thread_routine (__p=<optimized out>) at
> ../../../../../libstdc++-v3/src/c++11/thread.cc:92

That is this code executing.

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/thread.cc;h=0351f19e042b0701ba3c2597ecec87144fd631d5;hb=cf82a597b0d189857acb34a08725762c4f5afb50#l76

So the problem is we are generating an exception when our thread gets
run, we should probably catch that before it gets to here but that's
another story...

The exception is "buffer::malformed_input: entity_addr_t marker != 1"
and there is some precedent for this
(https://tracker.ceph.com/issues/21660,
https://tracker.ceph.com/issues/24819) but I don't think they are your
issue.

We generated that exception because we encountered an ill-formed
entity_addr_t whilst decoding a message.

Could you open a tracker for this issue and upload the entire log from
a crash, preferably with "debug ms >= 5" but be careful as this will
create very large log files. You can use ceph-post-file to upload
large compressed files.

Let me know the tracker ID here once you've created it.

P.S. This is likely fixed in a later version of Luminous since you
seem to be the only one hitting it. Either that or there is something
unusual about your environment.

>
> Has anyone else seen this? Any suggestions on how to proceed? I do
> intend to upgrade to Mimic but would prefer to do it when the cluster
> is stable.
>
> Thanks for your help.
> Josh
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com