Re: LVM osds loose connection to disk

Igor Fedotov <igor.fedotov@xxxxxxxx> · Sun, 9 Oct 2022 23:07:16 +0300

Hi Frank,

can't advise much on the disk issue  - just an obvious thought about 
upgrading the firmware and/or contacting the vendor. IIUC disk is 
totally inaccessible at this point, e.g. you're unable to read from it 
bypassing LVM as well, right? If so this definitely looks like a 
low-level problem.

As for OSD down issue - may I have some clarification please - did this 
osd.975 never go down or it was just a few minutes later? In the log 
snippet you shared I can see a 2 min gap between operation timeouts 
indication and the final OSD suicide. I presume it had been able to 
response heartbeats prior to that suicide and hence stayed online... But 
mostly speculating so far...

Thanks,

Igor

On 10/8/2022 6:43 PM, Frank Schilder wrote:
Hi all,

we are facing a very annoying and disruptive problem. This happens only on a single type of disk:

Vendor:               TOSHIBA
Product:              PX05SMB040Y
Revision:             AS10
Compliance:           SPC-4
User Capacity:        400,088,457,216 bytes [400 GB]

schedulers: mq-deadline kyber [bfq] none

The default for these disks is none. Could this be s problem?

On these disks we have 4 OSDs deployed (yes, the ones that ran out of space during conversion). These disks hold out ceph fs meta data. Currently there is no load, we unmounted all clients due to problems during OSD conversions. The problem seems more likely with hgh load, but does happan also with very little load, like we have now.

We run the OSD daemons inside a Centos8 container built from quay.io/ceph/ceph:v15.2.17 on a Centos7 host with kernel version

# uname -r
5.14.13-1.el7.elrepo.x86_64

The lvm versions on the host and inside the container are almost identical:

[host]# yum list installed | grep lvm
lvm2.x86_64                      7:2.02.187-6.el7_9.5       @updates
lvm2-libs.x86_64                 7:2.02.187-6.el7_9.5       @updates

[con]# yum list installed | grep lvm
lvm2.x86_64                                   8:2.03.14-5.el8                      @baseos
lvm2-libs.x86_64                              8:2.03.14-5.el8                      @baseos

We have >1000 OSDs and only the OSDs on these disks are causing trouble. The symptom is as if the disk suddenly gets stuck and does not accept IO any more. Trying to kill the hanging OSD daemons puts them in D-state.

The very very odd thing is, that ceph did not recognise all 4 down OSDs correctly. 1 out of 4 OSDs crashed (see log below) and the 3 other OSD daemons got stuck. These 3 stuck daemons were marked as down. However, the one that crashed was *not* marked as down even though it was dead for good (its process was not shown with ps any more, the other 3 were). This caused IO to hang and I don't understand how it is possible that this OSD was not recognised as  down. There must be plenty of reporters. I see a few messages like this (osd.975 crashed)

Oct  8 16:08:54 ceph-13 ceph-osd: 2022-10-08T16:08:54.913+0200 7f942817b700 -1 osd.990 912445 heartbeat_check: no reply from 192.168.32.88:7079 osd.975 since back 2022-10-08T16:08:34.029625+0200 front 2022-10-08T16:08:34.029288+0200 (oldest deadline 2022-10-08T16:08:54.528209+0200)
[...]
Oct  8 16:08:56 ceph-08 journal: 2022-10-08T16:08:56.195+0200 7fb85ce4d700 -1 osd.352 912445 heartbeat_check: no reply from 192.168.32.88:7079 osd.975 since back 2022-10-08T16:08:31.763519+0200 front 2022-10-08T16:08:31.764077+0200 (oldest deadline 2022-10-08T16:08:55.861407+0200)

But nothing happened. Here some OSD log info:

This is where everything starts:
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_num
files_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes,
  0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [default] **

2022-10-08T16:08:34.439+0200 7fbdf567a700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fbdd
1dc3700' had timed out after 15
2022-10-08T16:08:34.440+0200 7fbdf4e79700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fbdd1dc3700' had timed out after 15
[... loads and loads of these ...]
2022-10-08T16:10:51.065+0200 7fbdf4678700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fbdd
1dc3700' had suicide timed out after 150
2022-10-08T16:10:52.072+0200 7fbdf4678700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86
_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/
rpm/el8/BUILD/ceph-15.2.17/src/common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(
const ceph::heartbeat_handle_d*, const char*, ceph::coarse_mono_clock::rep)' thread 7fbdf4678700 tim
e 2022-10-08T16:10:52.065768+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
os8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/common/Heartbe
atMap.cc: 80: ceph_abort_msg("hit suicide timeout")

  ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_tr
aits<char>, std::allocator<char> > const&)+0xe5) [0x556b9b10cb32]
  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x295)
[0x556b9b82c795]
  3: (ceph::HeartbeatMap::is_healthy()+0x112) [0x556b9b82d292]
  4: (OSD::handle_osd_ping(MOSDPing*)+0xc2f) [0x556b9b1e253f]
  5: (OSD::heartbeat_dispatch(Message*)+0x1db) [0x556b9b1e44eb]
  6: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x155) [0x556b9bb83aa5]
  7: (ProtocolV2::handle_message()+0x142a) [0x556b9bbb941a]
  8: (ProtocolV2::handle_read_frame_dispatch()+0x258) [0x556b9bbcb418]
  9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x556b9bbcb515]
  10: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x92) [0x556b9bbcc912]
  11: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x556b9bbb480c]
  12: (AsyncConnection::process()+0x8a9) [0x556b9bb8b6c9]
  13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x556b9b9e22c7]
  14: (()+0xde78ac) [0x556b9b9e78ac]
  15: (()+0xc2ba3) [0x7fbdf84c8ba3]
  16: (()+0x81ca) [0x7fbdf8e751ca]
  17: (clone()+0x43) [0x7fbdf7adfdd3]

2022-10-08T16:10:52.078+0200 7fbdf4678700 -1 *** Caught signal (Aborted) **
  in thread 7fbdf4678700 thread_name:msgr-worker-2

  ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
  1: (()+0x12ce0) [0x7fbdf8e7fce0]
  2: (gsignal()+0x10f) [0x7fbdf7af4a9f]
  3: (abort()+0x127) [0x7fbdf7ac7e05]
  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x556b9b10cc03]
  5: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x295) [0x556b9b82c795]
  6: (ceph::HeartbeatMap::is_healthy()+0x112) [0x556b9b82d292]
  7: (OSD::handle_osd_ping(MOSDPing*)+0xc2f) [0x556b9b1e253f]
  8: (OSD::heartbeat_dispatch(Message*)+0x1db) [0x556b9b1e44eb]
  9: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x155) [0x556b9bb83aa5]
  10: (ProtocolV2::handle_message()+0x142a) [0x556b9bbb941a]
  11: (ProtocolV2::handle_read_frame_dispatch()+0x258) [0x556b9bbcb418]
  12: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x556b9bbcb515]
  13: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x92) [0x556b9bbcc912]
  14: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x556b9bbb480c]
  15: (AsyncConnection::process()+0x8a9) [0x556b9bb8b6c9]
  16: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x556b9b9e22c7]
  17: (()+0xde78ac) [0x556b9b9e78ac]
  17: (()+0xde78ac) [0x556b9b9e78ac]
  18: (()+0xc2ba3) [0x7fbdf84c8ba3]
  19: (()+0x81ca) [0x7fbdf8e751ca]
  20: (clone()+0x43) [0x7fbdf7adfdd3]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

What I'm most interested right now is if anyone has an idea what our underlying issue of these disks freezing might be and why the crashed OSD is not recognised as down. Any hints on what to check if it happens again are also welcome.

Many thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx