Strange OSD crash starts other osd flapping

Daznis <daznis@xxxxxxxxx> · Fri, 3 Aug 2018 12:16:43 +0300

Hello,

Yesterday I have encountered a strange osd crash which led to cluster
flapping. I had to force nodown flag on the cluster to finish the
flapping. The first osd that crashed with:

2018-08-02 17:23:23.275417 7f87ec8d7700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f8803dfb700' had timed out after 15
2018-08-02 17:23:23.275425 7f87ec8d7700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f8805dff700' had timed out after 15
....
2018-08-02 17:25:38.902142 7f8829df0700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f8803dfb700' had suicide timed out after 150
2018-08-02 17:25:38.907199 7f8829df0700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(const
ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f8829df0700
time 2018-08-02 17:25:38.902354
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x55872911fb65]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
const*, long)+0x2e1) [0x55872905e8f1]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x55872905f14e]
 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x55872905f92c]
 5: (CephContextServiceThread::entry()+0x15b) [0x55872913790b]
 6: (()+0x7e25) [0x7f882dc71e25]
 7: (clone()+0x6d) [0x7f882c2f8bad]

Then other osds started restarting with messages like this:

2018-08-02 17:37:14.859272 7f4bd31fe700  0 osd.44 184343
_committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last
600.000000 seconds, shutting down
2018-08-02 17:37:14.870121 7f4bd31fe700  0 osd.44 184343
_committed_osd_maps shutdown OSD via async signal
2018-08-02 17:37:14.870159 7f4bb9618700 -1 osd.44 184343 *** Got
signal Interrupt ***
2018-08-02 17:37:14.870167 7f4bb9618700  0 osd.44 184343
prepare_to_stop starting shutdown

There is a 10k line event dump with the first osd crash. I have looked
thru it  and nothing strange stuck with me. Any suggestions what I
should be looking for in it? I have checked nodes dmesg and switch
port logs. No info on flapping ports or  interface and completely no
errors with disk.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com