Hi Frank, That abort looks like this: } else if (r == -EMFILE || r == -ENFILE) { lderr(msgr->cct) << __func__ << " open file descriptions limit reached sd = " << listen_socket.fd() << " errno " << r << " " << cpp_strerror(r) << dendl; if (++accept_error_num > msgr->cct->_conf->ms_max_accept_failures) { lderr(msgr->cct) << "Proccessor accept has encountered enough error numbers, just do ceph_abort()." << dendl; ceph_abort(); } I think it's related to ulimits on your bare metal mds. Once you get this bare metal MDS running well, I think the next step to understand this issue is debug_mds, at least level 10, from when the issue starts. Cheers, Dan On Tue, Jan 14, 2025 at 6:31 AM Frank Schilder <frans@xxxxxx> wrote: > > Hi Dan, hi all, > > this is related to the thread "Help needed, ceph fs down due to large stray dir". We deployed a bare metal host for debugging ceph daemon issues, here, to run "perf top" to find out where our MDS becomes unresponsive. Unfortunately, we encounter a strange issue: > > The bare-metal MDS crashes very quickly during the initial reconnect phase: > > -61> 2025-01-14T08:59:47.202-0500 7f676e519700 3 mds.2.server not active yet, waiting > -60> 2025-01-14T08:59:47.202-0500 7f676e519700 5 mds.2.server dispatch request in up:reconnect: client_request(client.425250501:594886 lookup #0x3001059de1e/02JanParetoR > esults_n5_a7_m3 2025-01-13T15:25:32.427929-0500 RETRY=6 caller_uid=315104, caller_gid=315104{}) v2 > -59> 2025-01-14T08:59:47.202-0500 7f676e519700 3 mds.2.server not active yet, waiting > -58> 2025-01-14T08:59:47.202-0500 7f6771d20700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MAC > HINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph-16.2.15/src/msg/async/AsyncMessenger.cc: In function 'void Processor::accept()' thread 7f6771d20700 time 2025-01-14T08: > 59:47.200795-0500 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph > -16.2.15/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called") > > ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable) > 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f6776f6e904] > 2: (Processor::accept()+0x862) [0x7f6777261502] > 3: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7f67772b6b87] > 4: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc] > 5: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23] > 6: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca] > 7: clone() > > -57> 2025-01-14T08:59:47.202-0500 7f676e519700 0 log_channel(cluster) log [DBG] : reconnect by client.425250501 v1:192.168.57.60:0/1003106369 after 1.01409 > ... > -3> 2025-01-14T08:59:47.202-0500 7f676e519700 5 mds.2.server dispatch request in up:reconnect: client_request(client.42561 > 2912:251430 lookup #0x10000f5568d/util-linux 2025-01-11T17:43:34.212128-0500 RETRY=8 caller_uid=298337, caller_gid=298337{}) v2 > -2> 2025-01-14T08:59:47.202-0500 7f676e519700 3 mds.2.server not active yet, waiting > -1> 2025-01-14T08:59:47.202-0500 7f676e519700 0 log_channel(cluster) log [DBG] : reconnect by client.425612912 v1:192.168. > 58.11:0/4294630612 after 1.01409 > 0> 2025-01-14T08:59:47.203-0500 7f6771d20700 -1 *** Caught signal (Aborted) ** > in thread 7f6771d20700 thread_name:msgr-worker-0 > > ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable) > 1: /lib64/libpthread.so.0(+0x12d10) [0x7f6775f55d10] > 2: gsignal() > 3: abort() > 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator< > char> > const&)+0x1b6) [0x7f6776f6e9d5] > 5: (Processor::accept()+0x862) [0x7f6777261502] > 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7 > f67772b6b87] > 7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc] > 8: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23] > 9: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca] > 10: clone() > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > All other log messages look normal. The strange thing is that the bare metal and the containerized MDSes are the same version, yet the containerized daemon does *not* crash. Versions are: > > bare-metal# ceph-mds --version > ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable) > > container# ceph-mds --version > ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable) > > Both binaries have the same md5 sum as well. The only possibly relevant difference might be the kernel version: > > bare-metal: 4.18.0-553.34.1.el8_10.x86_64 > containerized: 5.14.13-1.el7.elrepo.x86_64 > > I also installed all sorts of debuginfo packages. Still, this symbol is not resolved: > > 7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc] > > Which package is it in? I did install ceph-common-debuginfo-16.2.15-0.el8.x86_64.rpm . > > For installing the MDS we followed the instructions for manual start here: https://docs.ceph.com/en/pacific/install/manual-deployment/#adding-mds . > > Thanks for any pointers! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 -- Dan van der Ster CTO @ CLYSO Try our Ceph Analyzer -- https://analyzer.clyso.com/ https://clyso.com | dan.vanderster@xxxxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx