Hi Dan, hi all, this is related to the thread "Help needed, ceph fs down due to large stray dir". We deployed a bare metal host for debugging ceph daemon issues, here, to run "perf top" to find out where our MDS becomes unresponsive. Unfortunately, we encounter a strange issue: The bare-metal MDS crashes very quickly during the initial reconnect phase: -61> 2025-01-14T08:59:47.202-0500 7f676e519700 3 mds.2.server not active yet, waiting -60> 2025-01-14T08:59:47.202-0500 7f676e519700 5 mds.2.server dispatch request in up:reconnect: client_request(client.425250501:594886 lookup #0x3001059de1e/02JanParetoR esults_n5_a7_m3 2025-01-13T15:25:32.427929-0500 RETRY=6 caller_uid=315104, caller_gid=315104{}) v2 -59> 2025-01-14T08:59:47.202-0500 7f676e519700 3 mds.2.server not active yet, waiting -58> 2025-01-14T08:59:47.202-0500 7f6771d20700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MAC HINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph-16.2.15/src/msg/async/AsyncMessenger.cc: In function 'void Processor::accept()' thread 7f6771d20700 time 2025-01-14T08: 59:47.200795-0500 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph -16.2.15/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called") ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f6776f6e904] 2: (Processor::accept()+0x862) [0x7f6777261502] 3: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7f67772b6b87] 4: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc] 5: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23] 6: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca] 7: clone() -57> 2025-01-14T08:59:47.202-0500 7f676e519700 0 log_channel(cluster) log [DBG] : reconnect by client.425250501 v1:192.168.57.60:0/1003106369 after 1.01409 ... -3> 2025-01-14T08:59:47.202-0500 7f676e519700 5 mds.2.server dispatch request in up:reconnect: client_request(client.42561 2912:251430 lookup #0x10000f5568d/util-linux 2025-01-11T17:43:34.212128-0500 RETRY=8 caller_uid=298337, caller_gid=298337{}) v2 -2> 2025-01-14T08:59:47.202-0500 7f676e519700 3 mds.2.server not active yet, waiting -1> 2025-01-14T08:59:47.202-0500 7f676e519700 0 log_channel(cluster) log [DBG] : reconnect by client.425612912 v1:192.168. 58.11:0/4294630612 after 1.01409 0> 2025-01-14T08:59:47.203-0500 7f6771d20700 -1 *** Caught signal (Aborted) ** in thread 7f6771d20700 thread_name:msgr-worker-0 ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable) 1: /lib64/libpthread.so.0(+0x12d10) [0x7f6775f55d10] 2: gsignal() 3: abort() 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator< char> > const&)+0x1b6) [0x7f6776f6e9d5] 5: (Processor::accept()+0x862) [0x7f6777261502] 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7 f67772b6b87] 7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc] 8: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23] 9: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca] 10: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. All other log messages look normal. The strange thing is that the bare metal and the containerized MDSes are the same version, yet the containerized daemon does *not* crash. Versions are: bare-metal# ceph-mds --version ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable) container# ceph-mds --version ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable) Both binaries have the same md5 sum as well. The only possibly relevant difference might be the kernel version: bare-metal: 4.18.0-553.34.1.el8_10.x86_64 containerized: 5.14.13-1.el7.elrepo.x86_64 I also installed all sorts of debuginfo packages. Still, this symbol is not resolved: 7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc] Which package is it in? I did install ceph-common-debuginfo-16.2.15-0.el8.x86_64.rpm . For installing the MDS we followed the instructions for manual start here: https://docs.ceph.com/en/pacific/install/manual-deployment/#adding-mds . Thanks for any pointers! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx