Re: MDS crashing on startup

Frank Schilder <frans@xxxxxx> · Tue, 14 Jan 2025 18:04:58 +0000

Hi Dan,

thanks a ton! Now I feel really stupid. I'm "a bit" under stress, so I forgot our ceph tuned profile. Thanks for reminding me and even more for providing such pointers even though I should know better on my own.

Why is the message about " open file descriptions limit reached sd = " not in the log (it should probably be "descriptors")? I don't see the usual clog(level) construct here. This message would have spared you writing two e-mails.

I have set up perf and unwindpmp and am ready to collect info about the MDS getting stuck. And yes, I misunderstood your message. You want to know what the MDS is doing right when it gets hung. It does have CPU load at this time and beyond so it should be visible in the backtraces.

Thanks again and hopefully something more interesting tomorrow!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dan.vanderster@xxxxxxxxx>
Sent: Tuesday, January 14, 2025 6:41 PM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re: MDS crashing on startup

Hi Frank,

That abort looks like this:

        } else if (r == -EMFILE || r == -ENFILE) {
          lderr(msgr->cct) << __func__ << " open file descriptions
limit reached sd = " << listen_socket.fd()
                           << " errno " << r << " " << cpp_strerror(r) << dendl;
          if (++accept_error_num > msgr->cct->_conf->ms_max_accept_failures) {
            lderr(msgr->cct) << "Proccessor accept has encountered
enough error numbers, just do ceph_abort()." << dendl;
            ceph_abort();
          }

I think it's related to ulimits on your bare metal mds.

Once you get this bare metal MDS running well, I think the next step
to understand this issue is debug_mds, at least level 10, from when
the issue starts.

Cheers, Dan

On Tue, Jan 14, 2025 at 6:31 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dan, hi all,
>
> this is related to the thread "Help needed, ceph fs down due to large stray dir". We deployed a bare metal host for debugging ceph daemon issues, here, to run "perf top" to find out where our MDS becomes unresponsive. Unfortunately, we encounter a strange issue:
>
> The bare-metal MDS crashes very quickly during the initial reconnect phase:
>
>    -61> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active yet, waiting
>    -60> 2025-01-14T08:59:47.202-0500 7f676e519700  5 mds.2.server dispatch request in up:reconnect: client_request(client.425250501:594886 lookup #0x3001059de1e/02JanParetoR
> esults_n5_a7_m3 2025-01-13T15:25:32.427929-0500 RETRY=6 caller_uid=315104, caller_gid=315104{}) v2
>    -59> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active yet, waiting
>    -58> 2025-01-14T08:59:47.202-0500 7f6771d20700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MAC
> HINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph-16.2.15/src/msg/async/AsyncMessenger.cc: In function 'void Processor::accept()' thread 7f6771d20700 time 2025-01-14T08:
> 59:47.200795-0500
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph
> -16.2.15/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called")
>
>  ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
>  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f6776f6e904]
>  2: (Processor::accept()+0x862) [0x7f6777261502]
>  3: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7f67772b6b87]
>  4: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]
>  5: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23]
>  6: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca]
>  7: clone()
>
>    -57> 2025-01-14T08:59:47.202-0500 7f676e519700  0 log_channel(cluster) log [DBG] : reconnect by client.425250501 v1:192.168.57.60:0/1003106369 after 1.01409
> ...
>     -3> 2025-01-14T08:59:47.202-0500 7f676e519700  5 mds.2.server dispatch request in up:reconnect: client_request(client.42561
> 2912:251430 lookup #0x10000f5568d/util-linux 2025-01-11T17:43:34.212128-0500 RETRY=8 caller_uid=298337, caller_gid=298337{}) v2
>     -2> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active yet, waiting
>     -1> 2025-01-14T08:59:47.202-0500 7f676e519700  0 log_channel(cluster) log [DBG] : reconnect by client.425612912 v1:192.168.
> 58.11:0/4294630612 after 1.01409
>      0> 2025-01-14T08:59:47.203-0500 7f6771d20700 -1 *** Caught signal (Aborted) **
>  in thread 7f6771d20700 thread_name:msgr-worker-0
>
>  ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
>  1: /lib64/libpthread.so.0(+0x12d10) [0x7f6775f55d10]
>  2: gsignal()
>  3: abort()
>  4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<
> char> > const&)+0x1b6) [0x7f6776f6e9d5]
>  5: (Processor::accept()+0x862) [0x7f6777261502]
>  6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7
> f67772b6b87]
>  7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]
>  8: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23]
>  9: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca]
>  10: clone()
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> All other log messages look normal. The strange thing is that the bare metal and the containerized MDSes are the same version, yet the containerized daemon does *not* crash. Versions are:
>
> bare-metal# ceph-mds --version
> ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
>
> container# ceph-mds --version
> ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
>
> Both binaries have the same md5 sum as well. The only possibly relevant difference might be the kernel version:
>
> bare-metal: 4.18.0-553.34.1.el8_10.x86_64
> containerized: 5.14.13-1.el7.elrepo.x86_64
>
> I also installed all sorts of debuginfo packages. Still, this symbol is not resolved:
>
>  7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]
>
> Which package is it in? I did install ceph-common-debuginfo-16.2.15-0.el8.x86_64.rpm .
>
> For installing the MDS we followed the instructions for manual start here: https://docs.ceph.com/en/pacific/install/manual-deployment/#adding-mds .
>
> Thanks for any pointers!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14

--
Dan van der Ster
CTO @ CLYSO
Try our Ceph Analyzer -- https://analyzer.clyso.com/
https://clyso.com | dan.vanderster@xxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx