Re: MDS 17.2.7 crashes at rejoin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a known issue, please see https://tracker.ceph.com/issues/60986.

If you could reproduce it then please enable the mds debug logs and this could help debug it fast:

debug_mds = 25

debug_ms = 1

Thanks

- Xiubo



On 5/7/24 00:26, Robert Sander wrote:
Hi,

a 17.2.7 cluster with two filesystems has suddenly non-working MDSs:

# ceph -s
  cluster:
    id:     f54eea86-265a-11eb-a5d0-457857ba5742
    health: HEALTH_ERR
            22 failed cephadm daemon(s)
            2 filesystems are degraded
            1 mds daemon damaged
            insufficient standby MDS daemons available

  services:
    mon: 5 daemons, quorum ceph00,ceph03,ceph04,ceph01,ceph02 (age 4h)
    mgr: ceph03.odfupq(active, since 4h), standbys: ppc721.vsincn, ceph00.lvbddp, ceph02.zhyxjg, ceph06.eifppc
    mds: 4/5 daemons up
    osd: 145 osds: 145 up (since 20h), 145 in (since 2d)
    rgw: 12 daemons active (4 hosts, 1 zones)

  data:
    volumes: 0/2 healthy, 2 recovering; 1 damaged
    pools:   15 pools, 4897 pgs
    objects: 195.64M objects, 195 TiB
    usage:   617 TiB used, 527 TiB / 1.1 PiB avail
    pgs:     4892 active+clean
             5    active+clean+scrubbing+deep

  io:
    client:   2.5 MiB/s rd, 20 MiB/s wr, 665 op/s rd, 938 op/s wr

# ceph fs status
ABC - 4 clients
===========
RANK   STATE              MDS             ACTIVITY   DNS    INOS DIRS   CAPS
 0     failed
 1    resolve  ABC.ceph04.lzlkdu               0      3 1      0
 2    resolve  ABC.ppc721.rzfmyi               0      3 1      0
 3    resolve  ABC.ceph04.jiepaw             249    252 13      0
          POOL             TYPE     USED  AVAIL
cephfs.ABC.meta  metadata  33.0G   104T
cephfs.ABC.data    data     390T   104T
DEF - 154 clients
===========
RANK      STATE                 MDS             ACTIVITY   DNS INOS   DIRS   CAPS  0    rejoin(laggy)  DEF.ceph06.etthum            30.9k  30.8k 5084      0
          POOL             TYPE     USED  AVAIL
cephfs.DEF.meta  metadata   190G   104T
cephfs.DEF.data    data     118T   104T
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)


The first filesystem will not get an MDS in rank 0,
we already tried to set max_msd to 1 but to no avail.

The second filesystem's MDS shows "replay" for a while and then
it crashes in the rejoin phase with:

  -92> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 handle_mds_map i am now mds.0.501522    -91> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 handle_mds_map state change up:reconnect --> up:rejoin    -90> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 rejoin_start    -89> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 rejoin_joint_start    -88> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfece err -22/0    -87> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671eb5 err -22/0    -86> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfed3 err -22/0    -85> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0    -84> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0    -83> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671eb5 err -22/0    -82> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0    -81> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ebd err -22/-22    -80> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ecd err -22/-22    -79> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc9ea err -22/-22    -78> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfed3 err -22/0    -77> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc9c3 err -22/-22    -76> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc978 err -22/-22    -75> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc99d err -22/-22    -74> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc95b err -22/-22    -73> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22    -72> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0    -71> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x20001dc7a7e err -22/-22    -70> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012be364 err -22/-22    -69> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22    -68> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671eb5 err -22/0    -67> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0    -66> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ebd err -22/-22    -65> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ecd err -22/-22    -64> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc9ea err -22/-22    -63> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc978 err -22/-22    -62> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x3000069373a err -22/-22    -61> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012dc5d8 err -22/-22    -60> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a32e8 err -22/-22    -59> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000696952 err -22/-22    -58> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0    -57> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfed3 err -22/0    -56> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc99d err -22/-22    -55> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22    -54> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x20001dc7a7e err -22/-22    -53> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22    -52> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012c3a0e err -22/-22    -51> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0    -50> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0    -49> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x20001dc7a7f err -22/-22    -48> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc95b err -22/-22    -47> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22    -46> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012be364 err -22/-22    -45> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b388d err -22/-22    -44> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x10007185ac2 err -22/-22    -43> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0    -42> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22    -41> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc99d err -22/-22    -40> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22    -39> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012c3a0e err -22/-22    -38> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22    -37> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x40000d63bff err -22/-22    -36> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0    -35> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc95b err -22/-22    -34> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012be364 err -22/-22    -33> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x10007185ac2 err -22/-22    -32> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0    -31> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x10007185ac4 err -22/-22    -30> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22    -29> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22    -28> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22    -27> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58f4 err -22/-22    -26> 2024-05-06T16:07:15.530+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0    -25> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22    -24> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22    -23> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22    -22> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58f4 err -22/-22    -21> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a5fc4 err -22/-22    -20> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a634d err -22/-22    -19> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a63bf err -22/-22    -18> 2024-05-06T16:07:15.542+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0    -17> 2024-05-06T16:07:15.542+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22    -16> 2024-05-06T16:07:15.546+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22    -15> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22    -14> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58f4 err -22/-22    -13> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a5fc4 err -22/-22    -12> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a634d err -22/-22    -11> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a63bf err -22/-22    -10> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bfd9c err -22/-22     -9> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bfb78 err -22/-22     -8> 2024-05-06T16:07:15.554+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0     -7> 2024-05-06T16:07:15.554+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22     -6> 2024-05-06T16:07:15.554+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b388d err -22/-22     -5> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0     -4> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22     -3> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b388d err -22/-22     -2> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x40000d5a226 err -22/-22     -1> 2024-05-06T16:07:15.634+0000 7f1921e91700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc: In function 'void MDCache::rejoin_send_rejoins()' thread 7f1921e91700 time 2024-05-06T16:07:15.635683+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc: 4086: FAILED ceph_assert(auth >= 0)

 ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f1930ad94a3]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f1930ad9669]
 3: (MDCache::rejoin_send_rejoins()+0x216b) [0x5614ac8747eb]
 4: (MDCache::process_imported_caps()+0x1993) [0x5614ac872353]
 5: (Context::complete(int)+0xd) [0x5614ac6e182d]
 6: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
 7: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5614ac6e6f5d]  8: (OpenFileTable::_open_ino_finish(inodeno_t, int)+0x156) [0x5614aca765a6]
 9: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
 10: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5614ac6e6f5d]  11: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, int)+0x138) [0x5614ac867168]  12: (MDCache::_open_ino_backtrace_fetched(inodeno_t, ceph::buffer::v15_2_0::list&, int)+0x290) [0x5614ac87ff90]
 13: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
 14: (MDSIOContextBase::complete(int)+0x534) [0x5614aca426e4]
 15: (Finisher::finisher_thread_entry()+0x18d) [0x7f1930b7884d]
 16: /lib64/libpthread.so.0(+0x81ca) [0x7f192fac81ca]
 17: clone()

How do we solve this issue?

Regards
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux