Created: https://tracker.ceph.com/issues/36250 On Tue, Sep 25, 2018 at 9:08 PM Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: > > On Tue, Sep 25, 2018 at 11:31 PM Josh Haft <paccrap@xxxxxxxxx> wrote: > > > > Hi cephers, > > > > I have a cluster of 7 storage nodes with 12 drives each and the OSD > > processes are regularly crashing. All 84 have crashed at least once in > > the past two days. Cluster is Luminous 12.2.2 on CentOS 7.4.1708, > > kernel version 3.10.0-693.el7.x86_64. I rebooted one of the OSD nodes > > to see if that cleared up the issue, but it did not. This problem has > > been going on for about a month now, but it was much less frequent > > initially - I'd see a crash once every few days or so. I took a look > > through the mailing list and bug reports, but wasn't able to find > > anything resembling this problem. > > > > I am running a second cluster - also 12.2.2, CentOS 7.4.1708, and > > kernel version 3.10.0-693.el7.x86_64 - but I do not see the issue > > there. > > > > Log messages always look similar to the following, and I've pulled out > > the back trace from a core dump as well. The aborting thread always > > looks to be msgr-worker. > > > > <SNIP> > > > #7 0x00007f9e731a3a36 in __cxxabiv1::__terminate (handler=<optimized > > out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38 > > #8 0x00007f9e731a3a63 in std::terminate () at > > ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 > > #9 0x00007f9e731fa345 in std::(anonymous > > namespace)::execute_native_thread_routine (__p=<optimized out>) at > > ../../../../../libstdc++-v3/src/c++11/thread.cc:92 > > That is this code executing. > > https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/src/c%2B%2B11/thread.cc;h=0351f19e042b0701ba3c2597ecec87144fd631d5;hb=cf82a597b0d189857acb34a08725762c4f5afb50#l76 > > So the problem is we are generating an exception when our thread gets > run, we should probably catch that before it gets to here but that's > another story... > > The exception is "buffer::malformed_input: entity_addr_t marker != 1" > and there is some precedent for this > (https://tracker.ceph.com/issues/21660, > https://tracker.ceph.com/issues/24819) but I don't think they are your > issue. > > We generated that exception because we encountered an ill-formed > entity_addr_t whilst decoding a message. > > Could you open a tracker for this issue and upload the entire log from > a crash, preferably with "debug ms >= 5" but be careful as this will > create very large log files. You can use ceph-post-file to upload > large compressed files. > > Let me know the tracker ID here once you've created it. > > P.S. This is likely fixed in a later version of Luminous since you > seem to be the only one hitting it. Either that or there is something > unusual about your environment. > > > > > Has anyone else seen this? Any suggestions on how to proceed? I do > > intend to upgrade to Mimic but would prefer to do it when the cluster > > is stable. > > > > Thanks for your help. > > Josh > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Cheers, > Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com