Hi cephers, I have a cluster of 7 storage nodes with 12 drives each and the OSD processes are regularly crashing. All 84 have crashed at least once in the past two days. Cluster is Luminous 12.2.2 on CentOS 7.4.1708, kernel version 3.10.0-693.el7.x86_64. I rebooted one of the OSD nodes to see if that cleared up the issue, but it did not. This problem has been going on for about a month now, but it was much less frequent initially - I'd see a crash once every few days or so. I took a look through the mailing list and bug reports, but wasn't able to find anything resembling this problem. I am running a second cluster - also 12.2.2, CentOS 7.4.1708, and kernel version 3.10.0-693.el7.x86_64 - but I do not see the issue there. Log messages always look similar to the following, and I've pulled out the back trace from a core dump as well. The aborting thread always looks to be msgr-worker. Sep 25 00:31:09 sn02 ceph-osd[26077]: terminate called after throwing an instance of 'ceph::buffer::malformed_input' Sep 25 00:31:09 sn02 ceph-osd[26077]: what(): buffer::malformed_input: entity_addr_t marker != 1 Sep 25 00:31:09 sn02 ceph-osd[26077]: *** Caught signal (Aborted) ** Sep 25 00:31:09 sn02 ceph-osd[26077]: in thread 7fd7d5200700 thread_name:msgr-worker-2 Sep 25 00:31:09 sn02 ceph-osd[26077]: ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) Sep 25 00:31:09 sn02 ceph-osd[26077]: 1: (()+0xa339e1) [0x56310fbe39e1] Sep 25 00:31:09 sn02 ceph-osd[26077]: 2: (()+0xf5e0) [0x7fd7d8ae25e0] Sep 25 00:31:09 sn02 ceph-osd[26077]: 3: (gsignal()+0x37) [0x7fd7d7b0b1f7] Sep 25 00:31:09 sn02 ceph-osd[26077]: 4: (abort()+0x148) [0x7fd7d7b0c8e8] Sep 25 00:31:09 sn02 ceph-osd[26077]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fd7d8411ac5] Sep 25 00:31:09 sn02 ceph-osd[26077]: 6: (()+0x5ea36) [0x7fd7d840fa36] Sep 25 00:31:09 sn02 ceph-osd[26077]: 7: (()+0x5ea63) [0x7fd7d840fa63] Sep 25 00:31:09 sn02 ceph-osd[26077]: 8: (()+0xb5345) [0x7fd7d8466345] Sep 25 00:31:09 sn02 ceph-osd[26077]: 9: (()+0x7e25) [0x7fd7d8adae25] Sep 25 00:31:09 sn02 ceph-osd[26077]: 10: (clone()+0x6d) [0x7fd7d7bce34d] Sep 25 00:31:09 sn02 ceph-osd[26077]: 2018-09-25 00:31:09.849285 7fd7d5200700 -1 *** Caught signal (Aborted) ** in thread 7fd7d5200700 thread_name:msgr-worker-2 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) 1: (()+0xa339e1) [0x56310fbe39e1] 2: (()+0xf5e0) [0x7fd7d8ae25e0] 3: (gsignal()+0x37) [0x7fd7d7b0b1f7] 4: (abort()+0x148) [0x7fd7d7b0c8e8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fd7d8411ac5] 6: (()+0x5ea36) [0x7fd7d840fa36] 7: (()+0x5ea63) [0x7fd7d840fa63] 8: (()+0xb5345) [0x7fd7d8466345] 9: (()+0x7e25) [0x7fd7d8adae25] 10: (clone()+0x6d) [0x7fd7d7bce34d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. #0 0x00007f9e738764ab in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000055925e1edab6 in reraise_fatal (signum=6) at /usr/src/debug/ceph-12.2.2/src/global/signal_handler.cc:74 #2 handle_fatal_signal (signum=6) at /usr/src/debug/ceph-12.2.2/src/global/signal_handler.cc:138 #3 <signal handler called> #4 0x00007f9e7289f1f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #5 0x00007f9e728a08e8 in __GI_abort () at abort.c:90 #6 0x00007f9e731a5ac5 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95 #7 0x00007f9e731a3a36 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38 #8 0x00007f9e731a3a63 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 #9 0x00007f9e731fa345 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:92 #10 0x00007f9e7386ee25 in start_thread (arg=0x7f9e6ff94700) at pthread_create.c:308 #11 0x00007f9e7296234d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Has anyone else seen this? Any suggestions on how to proceed? I do intend to upgrade to Mimic but would prefer to do it when the cluster is stable. Thanks for your help. Josh _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com