Re: MDS crashing on startup

Dan van der Ster <dan.vanderster@xxxxxxxxx> · Tue, 14 Jan 2025 09:41:51 -0800

Hi Frank,

That abort looks like this:

        } else if (r == -EMFILE || r == -ENFILE) {
          lderr(msgr->cct) << __func__ << " open file descriptions
limit reached sd = " << listen_socket.fd()
                           << " errno " << r << " " << cpp_strerror(r) << dendl;
          if (++accept_error_num > msgr->cct->_conf->ms_max_accept_failures) {
            lderr(msgr->cct) << "Proccessor accept has encountered
enough error numbers, just do ceph_abort()." << dendl;
            ceph_abort();
          }

I think it's related to ulimits on your bare metal mds.

Once you get this bare metal MDS running well, I think the next step
to understand this issue is debug_mds, at least level 10, from when
the issue starts.

Cheers, Dan

On Tue, Jan 14, 2025 at 6:31 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dan, hi all,
>
> this is related to the thread "Help needed, ceph fs down due to large stray dir". We deployed a bare metal host for debugging ceph daemon issues, here, to run "perf top" to find out where our MDS becomes unresponsive. Unfortunately, we encounter a strange issue:
>
> The bare-metal MDS crashes very quickly during the initial reconnect phase:
>
>    -61> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active yet, waiting
>    -60> 2025-01-14T08:59:47.202-0500 7f676e519700  5 mds.2.server dispatch request in up:reconnect: client_request(client.425250501:594886 lookup #0x3001059de1e/02JanParetoR
> esults_n5_a7_m3 2025-01-13T15:25:32.427929-0500 RETRY=6 caller_uid=315104, caller_gid=315104{}) v2
>    -59> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active yet, waiting
>    -58> 2025-01-14T08:59:47.202-0500 7f6771d20700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MAC
> HINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph-16.2.15/src/msg/async/AsyncMessenger.cc: In function 'void Processor::accept()' thread 7f6771d20700 time 2025-01-14T08:
> 59:47.200795-0500
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph
> -16.2.15/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called")
>
>  ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
>  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f6776f6e904]
>  2: (Processor::accept()+0x862) [0x7f6777261502]
>  3: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7f67772b6b87]
>  4: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]
>  5: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23]
>  6: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca]
>  7: clone()
>
>    -57> 2025-01-14T08:59:47.202-0500 7f676e519700  0 log_channel(cluster) log [DBG] : reconnect by client.425250501 v1:192.168.57.60:0/1003106369 after 1.01409
> ...
>     -3> 2025-01-14T08:59:47.202-0500 7f676e519700  5 mds.2.server dispatch request in up:reconnect: client_request(client.42561
> 2912:251430 lookup #0x10000f5568d/util-linux 2025-01-11T17:43:34.212128-0500 RETRY=8 caller_uid=298337, caller_gid=298337{}) v2
>     -2> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active yet, waiting
>     -1> 2025-01-14T08:59:47.202-0500 7f676e519700  0 log_channel(cluster) log [DBG] : reconnect by client.425612912 v1:192.168.
> 58.11:0/4294630612 after 1.01409
>      0> 2025-01-14T08:59:47.203-0500 7f6771d20700 -1 *** Caught signal (Aborted) **
>  in thread 7f6771d20700 thread_name:msgr-worker-0
>
>  ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
>  1: /lib64/libpthread.so.0(+0x12d10) [0x7f6775f55d10]
>  2: gsignal()
>  3: abort()
>  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<
> char> > const&)+0x1b6) [0x7f6776f6e9d5]
>  5: (Processor::accept()+0x862) [0x7f6777261502]
>  6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7
> f67772b6b87]
>  7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]
>  8: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23]
>  9: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca]
>  10: clone()
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> All other log messages look normal. The strange thing is that the bare metal and the containerized MDSes are the same version, yet the containerized daemon does *not* crash. Versions are:
>
> bare-metal# ceph-mds --version
> ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
>
> container# ceph-mds --version
> ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
>
> Both binaries have the same md5 sum as well. The only possibly relevant difference might be the kernel version:
>
> bare-metal: 4.18.0-553.34.1.el8_10.x86_64
> containerized: 5.14.13-1.el7.elrepo.x86_64
>
> I also installed all sorts of debuginfo packages. Still, this symbol is not resolved:
>
>  7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]
>
> Which package is it in? I did install ceph-common-debuginfo-16.2.15-0.el8.x86_64.rpm .
>
> For installing the MDS we followed the instructions for manual start here: https://docs.ceph.com/en/pacific/install/manual-deployment/#adding-mds .
>
> Thanks for any pointers!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14

-- 
Dan van der Ster
CTO @ CLYSO
Try our Ceph Analyzer -- https://analyzer.clyso.com/
https://clyso.com | dan.vanderster@xxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx