Hello, Yesterday I have encountered a strange osd crash which led to cluster flapping. I had to force nodown flag on the cluster to finish the flapping. The first osd that crashed with: 2018-08-02 17:23:23.275417 7f87ec8d7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8803dfb700' had timed out after 15 2018-08-02 17:23:23.275425 7f87ec8d7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8805dff700' had timed out after 15 .... 2018-08-02 17:25:38.902142 7f8829df0700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8803dfb700' had suicide timed out after 150 2018-08-02 17:25:38.907199 7f8829df0700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f8829df0700 time 2018-08-02 17:25:38.902354 common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x55872911fb65] 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x55872905e8f1] 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x55872905f14e] 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x55872905f92c] 5: (CephContextServiceThread::entry()+0x15b) [0x55872913790b] 6: (()+0x7e25) [0x7f882dc71e25] 7: (clone()+0x6d) [0x7f882c2f8bad] Then other osds started restarting with messages like this: 2018-08-02 17:37:14.859272 7f4bd31fe700 0 osd.44 184343 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down 2018-08-02 17:37:14.870121 7f4bd31fe700 0 osd.44 184343 _committed_osd_maps shutdown OSD via async signal 2018-08-02 17:37:14.870159 7f4bb9618700 -1 osd.44 184343 *** Got signal Interrupt *** 2018-08-02 17:37:14.870167 7f4bb9618700 0 osd.44 184343 prepare_to_stop starting shutdown There is a 10k line event dump with the first osd crash. I have looked thru it and nothing strange stuck with me. Any suggestions what I should be looking for in it? I have checked nodes dmesg and switch port logs. No info on flapping ports or interface and completely no errors with disk. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com