OSDs crashing

Josh Haft <paccrap@xxxxxxxxx> · Tue, 25 Sep 2018 08:31:23 -0500

Hi cephers,

I have a cluster of 7 storage nodes with 12 drives each and the OSD
processes are regularly crashing. All 84 have crashed at least once in
the past two days. Cluster is Luminous 12.2.2 on CentOS 7.4.1708,
kernel version 3.10.0-693.el7.x86_64. I rebooted one of the OSD nodes
to see if that cleared up the issue, but it did not. This problem has
been going on for about a month now, but it was much less frequent
initially - I'd see a crash once every few days or so. I took a look
through the mailing list and bug reports, but wasn't able to find
anything resembling this problem.

I am running a second cluster - also 12.2.2, CentOS 7.4.1708, and
kernel version 3.10.0-693.el7.x86_64 - but I do not see the issue
there.

Log messages always look similar to the following, and I've pulled out
the back trace from a core dump as well. The aborting thread always
looks to be msgr-worker.

Sep 25 00:31:09 sn02 ceph-osd[26077]: terminate called after throwing
an instance of 'ceph::buffer::malformed_input'
Sep 25 00:31:09 sn02 ceph-osd[26077]: what():
buffer::malformed_input: entity_addr_t marker != 1
Sep 25 00:31:09 sn02 ceph-osd[26077]: *** Caught signal (Aborted) **
Sep 25 00:31:09 sn02 ceph-osd[26077]: in thread 7fd7d5200700
thread_name:msgr-worker-2
Sep 25 00:31:09 sn02 ceph-osd[26077]: ceph version 12.2.2
(cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
Sep 25 00:31:09 sn02 ceph-osd[26077]: 1: (()+0xa339e1) [0x56310fbe39e1]
Sep 25 00:31:09 sn02 ceph-osd[26077]: 2: (()+0xf5e0) [0x7fd7d8ae25e0]
Sep 25 00:31:09 sn02 ceph-osd[26077]: 3: (gsignal()+0x37) [0x7fd7d7b0b1f7]
Sep 25 00:31:09 sn02 ceph-osd[26077]: 4: (abort()+0x148) [0x7fd7d7b0c8e8]
Sep 25 00:31:09 sn02 ceph-osd[26077]: 5:
(__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fd7d8411ac5]
Sep 25 00:31:09 sn02 ceph-osd[26077]: 6: (()+0x5ea36) [0x7fd7d840fa36]
Sep 25 00:31:09 sn02 ceph-osd[26077]: 7: (()+0x5ea63) [0x7fd7d840fa63]
Sep 25 00:31:09 sn02 ceph-osd[26077]: 8: (()+0xb5345) [0x7fd7d8466345]
Sep 25 00:31:09 sn02 ceph-osd[26077]: 9: (()+0x7e25) [0x7fd7d8adae25]
Sep 25 00:31:09 sn02 ceph-osd[26077]: 10: (clone()+0x6d) [0x7fd7d7bce34d]
Sep 25 00:31:09 sn02 ceph-osd[26077]: 2018-09-25 00:31:09.849285
7fd7d5200700 -1 *** Caught signal (Aborted) **
                                                  in thread
7fd7d5200700 thread_name:msgr-worker-2

                                                  ceph version 12.2.2
(cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
                                                  1: (()+0xa339e1)
[0x56310fbe39e1]
                                                  2: (()+0xf5e0)
[0x7fd7d8ae25e0]
                                                  3: (gsignal()+0x37)
[0x7fd7d7b0b1f7]
                                                  4: (abort()+0x148)
[0x7fd7d7b0c8e8]
                                                  5:
(__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fd7d8411ac5]
                                                  6: (()+0x5ea36)
[0x7fd7d840fa36]
                                                  7: (()+0x5ea63)
[0x7fd7d840fa63]
                                                  8: (()+0xb5345)
[0x7fd7d8466345]
                                                  9: (()+0x7e25)
[0x7fd7d8adae25]
                                                  10: (clone()+0x6d)
[0x7fd7d7bce34d]
                                                  NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret
this.

#0  0x00007f9e738764ab in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x000055925e1edab6 in reraise_fatal (signum=6) at
/usr/src/debug/ceph-12.2.2/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at
/usr/src/debug/ceph-12.2.2/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  0x00007f9e7289f1f7 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#5  0x00007f9e728a08e8 in __GI_abort () at abort.c:90
#6  0x00007f9e731a5ac5 in __gnu_cxx::__verbose_terminate_handler () at
../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#7  0x00007f9e731a3a36 in __cxxabiv1::__terminate (handler=<optimized
out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38
#8  0x00007f9e731a3a63 in std::terminate () at
../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00007f9e731fa345 in std::(anonymous
namespace)::execute_native_thread_routine (__p=<optimized out>) at
../../../../../libstdc++-v3/src/c++11/thread.cc:92
#10 0x00007f9e7386ee25 in start_thread (arg=0x7f9e6ff94700) at
pthread_create.c:308
#11 0x00007f9e7296234d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Has anyone else seen this? Any suggestions on how to proceed? I do
intend to upgrade to Mimic but would prefer to do it when the cluster
is stable.

Thanks for your help.
Josh
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com