MDS crashing on startup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dan, hi all,

this is related to the thread "Help needed, ceph fs down due to large stray dir". We deployed a bare metal host for debugging ceph daemon issues, here, to run "perf top" to find out where our MDS becomes unresponsive. Unfortunately, we encounter a strange issue:

The bare-metal MDS crashes very quickly during the initial reconnect phase:

   -61> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active yet, waiting
   -60> 2025-01-14T08:59:47.202-0500 7f676e519700  5 mds.2.server dispatch request in up:reconnect: client_request(client.425250501:594886 lookup #0x3001059de1e/02JanParetoR
esults_n5_a7_m3 2025-01-13T15:25:32.427929-0500 RETRY=6 caller_uid=315104, caller_gid=315104{}) v2
   -59> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active yet, waiting
   -58> 2025-01-14T08:59:47.202-0500 7f6771d20700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MAC
HINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph-16.2.15/src/msg/async/AsyncMessenger.cc: In function 'void Processor::accept()' thread 7f6771d20700 time 2025-01-14T08:
59:47.200795-0500
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph
-16.2.15/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called")

 ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f6776f6e904]
 2: (Processor::accept()+0x862) [0x7f6777261502]
 3: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7f67772b6b87]
 4: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]
 5: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23]
 6: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca]
 7: clone()

   -57> 2025-01-14T08:59:47.202-0500 7f676e519700  0 log_channel(cluster) log [DBG] : reconnect by client.425250501 v1:192.168.57.60:0/1003106369 after 1.01409
...
    -3> 2025-01-14T08:59:47.202-0500 7f676e519700  5 mds.2.server dispatch request in up:reconnect: client_request(client.42561
2912:251430 lookup #0x10000f5568d/util-linux 2025-01-11T17:43:34.212128-0500 RETRY=8 caller_uid=298337, caller_gid=298337{}) v2
    -2> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active yet, waiting
    -1> 2025-01-14T08:59:47.202-0500 7f676e519700  0 log_channel(cluster) log [DBG] : reconnect by client.425612912 v1:192.168.
58.11:0/4294630612 after 1.01409
     0> 2025-01-14T08:59:47.203-0500 7f6771d20700 -1 *** Caught signal (Aborted) **
 in thread 7f6771d20700 thread_name:msgr-worker-0

 ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12d10) [0x7f6775f55d10]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<
char> > const&)+0x1b6) [0x7f6776f6e9d5]
 5: (Processor::accept()+0x862) [0x7f6777261502]
 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7
f67772b6b87]
 7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]
 8: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23]
 9: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca]
 10: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

All other log messages look normal. The strange thing is that the bare metal and the containerized MDSes are the same version, yet the containerized daemon does *not* crash. Versions are:

bare-metal# ceph-mds --version
ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)

container# ceph-mds --version
ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)

Both binaries have the same md5 sum as well. The only possibly relevant difference might be the kernel version:

bare-metal: 4.18.0-553.34.1.el8_10.x86_64
containerized: 5.14.13-1.el7.elrepo.x86_64

I also installed all sorts of debuginfo packages. Still, this symbol is not resolved:

 7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]

Which package is it in? I did install ceph-common-debuginfo-16.2.15-0.el8.x86_64.rpm .

For installing the MDS we followed the instructions for manual start here: https://docs.ceph.com/en/pacific/install/manual-deployment/#adding-mds .

Thanks for any pointers!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux