osds crash and restart in octopus

mahnoosh shahidi <mahnooosh.shd@xxxxxxxxx> · Mon, 23 Aug 2021 09:53:19 +0430

Hi everyone,

We have a problem with octopus 15.2.12. osds randomly crash and restart
with the following traceback log.

    -8> 2021-08-20T15:01:03.165+0430 7f2d10fd7700 10 monclient:
handle_auth_request added challenge on 0x55a3fc654400
    -7> 2021-08-20T15:01:03.201+0430 7f2d02960700  2 osd.202 1145364
ms_handle_reset con 0x55a548087000 session 0x55a4be8a4940
    -6> 2021-08-20T15:01:03.209+0430 7f2d02960700  2 osd.202 1145364
ms_handle_reset con 0x55a52aab2800 session 0x55a4497dd0c0
    -5> 2021-08-20T15:01:03.213+0430 7f2d02960700  2 osd.202 1145364
ms_handle_reset con 0x55a548084800 session 0x55a3fca0f860
    -4> 2021-08-20T15:01:03.217+0430 7f2d02960700  2 osd.202 1145364
ms_handle_reset con 0x55a3c5e50800 session 0x55a51c1b7680
    -3> 2021-08-20T15:01:03.217+0430 7f2d02960700  2 osd.202 1145364
ms_handle_reset con 0x55a3c5e52000 session 0x55a4055932a0
    -2> 2021-08-20T15:01:03.225+0430 7f2d02960700  2 osd.202 1145364
ms_handle_reset con 0x55a4b835f800 session 0x55a51c1b90c0
    -1> 2021-08-20T15:01:03.225+0430 7f2d107d6700 10 monclient:
handle_auth_request added challenge on 0x55a3c5e52000
     0> 2021-08-20T15:01:03.233+0430 7f2d0ffd5700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f2d0ffd5700 thread_name:msgr-worker-2

 ceph version 15.2.12 (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus
(stable)
 1: (()+0x12980) [0x7f2d144b0980]
 2: (AsyncConnection::_stop()+0x9c) [0x55a37bf56cdc]
 3: (ProtocolV2::stop()+0x8b) [0x55a37bf8016b]
 4: (ProtocolV2::_fault()+0x6b) [0x55a37bf8030b]
 5:
(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node,
ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x1d1) [0x55a37bf97d51]
 6: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x55a37bf80e64]
 7: (AsyncConnection::process()+0x5fc) [0x55a37bf59e0c]
 8: (EventCenter::process_events(unsigned int,
std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x7dd)
[0x55a37bda9a2d]
 9: (()+0x11d45a8) [0x55a37bdaf5a8]
 10: (()+0xbd6df) [0x7f2d13b886df]
 11: (()+0x76db) [0x7f2d144a56db]
 12: (clone()+0x3f) [0x7f2d1324571f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

Our cluster has 220 hdd disks and 200 ssds. We have separate nvme for DB
use in hdd osds. bucket indexes have also separate ssd disks.
Does anybody have any idea what the problem could be?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx