Hi all, we are facing a very annoying and disruptive problem. This happens only on a single type of disk: Vendor: TOSHIBA Product: PX05SMB040Y Revision: AS10 Compliance: SPC-4 User Capacity: 400,088,457,216 bytes [400 GB] schedulers: mq-deadline kyber [bfq] none The default for these disks is none. Could this be s problem? On these disks we have 4 OSDs deployed (yes, the ones that ran out of space during conversion). These disks hold out ceph fs meta data. Currently there is no load, we unmounted all clients due to problems during OSD conversions. The problem seems more likely with hgh load, but does happan also with very little load, like we have now. We run the OSD daemons inside a Centos8 container built from quay.io/ceph/ceph:v15.2.17 on a Centos7 host with kernel version # uname -r 5.14.13-1.el7.elrepo.x86_64 The lvm versions on the host and inside the container are almost identical: [host]# yum list installed | grep lvm lvm2.x86_64 7:2.02.187-6.el7_9.5 @updates lvm2-libs.x86_64 7:2.02.187-6.el7_9.5 @updates [con]# yum list installed | grep lvm lvm2.x86_64 8:2.03.14-5.el8 @baseos lvm2-libs.x86_64 8:2.03.14-5.el8 @baseos We have >1000 OSDs and only the OSDs on these disks are causing trouble. The symptom is as if the disk suddenly gets stuck and does not accept IO any more. Trying to kill the hanging OSD daemons puts them in D-state. The very very odd thing is, that ceph did not recognise all 4 down OSDs correctly. 1 out of 4 OSDs crashed (see log below) and the 3 other OSD daemons got stuck. These 3 stuck daemons were marked as down. However, the one that crashed was *not* marked as down even though it was dead for good (its process was not shown with ps any more, the other 3 were). This caused IO to hang and I don't understand how it is possible that this OSD was not recognised as down. There must be plenty of reporters. I see a few messages like this (osd.975 crashed) Oct 8 16:08:54 ceph-13 ceph-osd: 2022-10-08T16:08:54.913+0200 7f942817b700 -1 osd.990 912445 heartbeat_check: no reply from 192.168.32.88:7079 osd.975 since back 2022-10-08T16:08:34.029625+0200 front 2022-10-08T16:08:34.029288+0200 (oldest deadline 2022-10-08T16:08:54.528209+0200) [...] Oct 8 16:08:56 ceph-08 journal: 2022-10-08T16:08:56.195+0200 7fb85ce4d700 -1 osd.352 912445 heartbeat_check: no reply from 192.168.32.88:7079 osd.975 since back 2022-10-08T16:08:31.763519+0200 front 2022-10-08T16:08:31.764077+0200 (oldest deadline 2022-10-08T16:08:55.861407+0200) But nothing happened. Here some OSD log info: This is where everything starts: Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_num files_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count ** File Read Latency Histogram By Level [default] ** 2022-10-08T16:08:34.439+0200 7fbdf567a700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fbdd 1dc3700' had timed out after 15 2022-10-08T16:08:34.440+0200 7fbdf4e79700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fbdd1dc3700' had timed out after 15 [... loads and loads of these ...] 2022-10-08T16:10:51.065+0200 7fbdf4678700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fbdd 1dc3700' had suicide timed out after 150 2022-10-08T16:10:52.072+0200 7fbdf4678700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86 _64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/ rpm/el8/BUILD/ceph-15.2.17/src/common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check( const ceph::heartbeat_handle_d*, const char*, ceph::coarse_mono_clock::rep)' thread 7fbdf4678700 tim e 2022-10-08T16:10:52.065768+0200 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent os8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/common/Heartbe atMap.cc: 80: ceph_abort_msg("hit suicide timeout") ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_tr aits<char>, std::allocator<char> > const&)+0xe5) [0x556b9b10cb32] 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x295) [0x556b9b82c795] 3: (ceph::HeartbeatMap::is_healthy()+0x112) [0x556b9b82d292] 4: (OSD::handle_osd_ping(MOSDPing*)+0xc2f) [0x556b9b1e253f] 5: (OSD::heartbeat_dispatch(Message*)+0x1db) [0x556b9b1e44eb] 6: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x155) [0x556b9bb83aa5] 7: (ProtocolV2::handle_message()+0x142a) [0x556b9bbb941a] 8: (ProtocolV2::handle_read_frame_dispatch()+0x258) [0x556b9bbcb418] 9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x556b9bbcb515] 10: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x92) [0x556b9bbcc912] 11: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x556b9bbb480c] 12: (AsyncConnection::process()+0x8a9) [0x556b9bb8b6c9] 13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x556b9b9e22c7] 14: (()+0xde78ac) [0x556b9b9e78ac] 15: (()+0xc2ba3) [0x7fbdf84c8ba3] 16: (()+0x81ca) [0x7fbdf8e751ca] 17: (clone()+0x43) [0x7fbdf7adfdd3] 2022-10-08T16:10:52.078+0200 7fbdf4678700 -1 *** Caught signal (Aborted) ** in thread 7fbdf4678700 thread_name:msgr-worker-2 ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) 1: (()+0x12ce0) [0x7fbdf8e7fce0] 2: (gsignal()+0x10f) [0x7fbdf7af4a9f] 3: (abort()+0x127) [0x7fbdf7ac7e05] 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x556b9b10cc03] 5: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x295) [0x556b9b82c795] 6: (ceph::HeartbeatMap::is_healthy()+0x112) [0x556b9b82d292] 7: (OSD::handle_osd_ping(MOSDPing*)+0xc2f) [0x556b9b1e253f] 8: (OSD::heartbeat_dispatch(Message*)+0x1db) [0x556b9b1e44eb] 9: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x155) [0x556b9bb83aa5] 10: (ProtocolV2::handle_message()+0x142a) [0x556b9bbb941a] 11: (ProtocolV2::handle_read_frame_dispatch()+0x258) [0x556b9bbcb418] 12: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x556b9bbcb515] 13: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x92) [0x556b9bbcc912] 14: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x556b9bbb480c] 15: (AsyncConnection::process()+0x8a9) [0x556b9bb8b6c9] 16: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x556b9b9e22c7] 17: (()+0xde78ac) [0x556b9b9e78ac] 17: (()+0xde78ac) [0x556b9b9e78ac] 18: (()+0xc2ba3) [0x7fbdf84c8ba3] 19: (()+0x81ca) [0x7fbdf8e751ca] 20: (clone()+0x43) [0x7fbdf7adfdd3] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. What I'm most interested right now is if anyone has an idea what our underlying issue of these disks freezing might be and why the crashed OSD is not recognised as down. Any hints on what to check if it happens again are also welcome. Many thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx