Help needed! First MDs crashing, then MONs. How to recover ?

"Noe P." <ml@am-rand.berlin> · Tue, 28 May 2024 14:53:14 +0200 (CEST)

Hi,

we ran into a bigger problem today with our ceph cluster (Quincy,
Alma8.9).
We have 4 filesystems and a total of 6 MDs, the largest fs having
two ranks assigned (i.e. one standby).

Since we often have the problem of MDs lagging behind, we restart
the MDs occasionally. Helps ususally, the standby taking over.

Today however, the restart didn't work and the rank 1 MDs started to
crash for unclear reasons. Rank 0 seemed ok.

We decided at some point to go back to one rank by settings max_mds to 1.
Due to the permanent crashing, the rank1 didn't stop however, and at some
point we set it to failed and the fs not joinable.

At this point it looked like this:
 fs_cluster - 716 clients
 ==========
 RANK  STATE     MDS        ACTIVITY     DNS    INOS   DIRS   CAPS
  0    active  cephmd6a  Reqs:    0 /s  13.1M  13.1M  1419k  79.2k
  1    failed
       POOL         TYPE     USED  AVAIL
 fs_cluster_meta  metadata  1791G  54.2T
 fs_cluster_data    data     421T  54.2T

with rank1 still being listed.

The next attempt was to remove that failed

   ceph mds rmfailed fs_cluster:1 --yes-i-really-mean-it

which, after a short while brought down 3 out of five MONs.
They keep crashing shortly after restart with stack traces like this:

    ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
    1: /lib64/libpthread.so.0(+0x12cf0) [0x7ff8813adcf0]
    2: gsignal()
    3: abort()
    4: /lib64/libstdc++.so.6(+0x9009b) [0x7ff8809bf09b]
    5: /lib64/libstdc++.so.6(+0x9654c) [0x7ff8809c554c]
    6: /lib64/libstdc++.so.6(+0x965a7) [0x7ff8809c55a7]
    7: /lib64/libstdc++.so.6(+0x96808) [0x7ff8809c5808]
    8: /lib64/libstdc++.so.6(+0x92045) [0x7ff8809c1045]
    9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xa9e) [0x55f05d9a5e8e]
    10: (MDSMonitor::tick()+0x18a) [0x55f05d9b18da]
    11: (MDSMonitor::on_active()+0x2c) [0x55f05d99a17c]
    12: (Context::complete(int)+0xd) [0x55f05d76c56d]
    13: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x9d) [0x55f05
   d799d7d]
    14: (Paxos::finish_round()+0x74) [0x55f05d8c5c24]
    15: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x41b) [0x55f05d8c7e5b]
    16: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x123e) [0x55f05d76a2ae]
    17: (Monitor::_ms_dispatch(Message*)+0x406) [0x55f05d76a976]
    18: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5d) [0x55f05d79b3ed]
    19: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7ff88367fed8]
    20: (DispatchQueue::entry()+0x50f) [0x7ff88367d31f]
    21: (DispatchQueue::DispatchThread::entry()+0x11) [0x7ff883747381]
    22: /lib64/libpthread.so.0(+0x81ca) [0x7ff8813a31ca]
    23: clone()
    NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The MDSMonitor::maybe_resize_cluster somehow suggests a connection to the above MDs operation.

Does anyone have an idea how to get this cluster back together again ? Like manually fixing the
MD ranks ?

'fs get' can still be called in short moments where enough MONs are reachable:

   fs_name fs_cluster
   epoch   3065486
   flags   13 allow_snaps allow_multimds_snaps
   created 2022-08-26T15:55:07.186477+0200
   modified        2024-05-28T12:21:59.294364+0200
   tableserver     0
   root    0
   session_timeout 60
   session_autoclose       300
   max_file_size   4398046511104
   required_client_features        {}
   last_failure    0
   last_failure_osd_epoch  1777109
   compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
   max_mds 1
   in      0,1
   up      {0=911794623}
   failed
   damaged
   stopped 2,3
   data_pools      [32]
   metadata_pool   33
   inline_data     disabled
   balancer
   standby_count_wanted    1
   [mds.cephmd6a{0:911794623} state up:active seq 22777 addr [v2:10.13.5.6:6800/189084355,v1:10.13.5.6:6801/189084355] compat {c=[1],r=[1],i=[7ff]}]

Regards,
  Noe
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx