MDS 17.2.7 crashes at rejoin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

a 17.2.7 cluster with two filesystems has suddenly non-working MDSs:

# ceph -s
  cluster:
    id:     f54eea86-265a-11eb-a5d0-457857ba5742
    health: HEALTH_ERR
            22 failed cephadm daemon(s)
            2 filesystems are degraded
            1 mds daemon damaged
            insufficient standby MDS daemons available
services:
    mon: 5 daemons, quorum ceph00,ceph03,ceph04,ceph01,ceph02 (age 4h)
    mgr: ceph03.odfupq(active, since 4h), standbys: ppc721.vsincn, ceph00.lvbddp, ceph02.zhyxjg, ceph06.eifppc
    mds: 4/5 daemons up
    osd: 145 osds: 145 up (since 20h), 145 in (since 2d)
    rgw: 12 daemons active (4 hosts, 1 zones)
data:
    volumes: 0/2 healthy, 2 recovering; 1 damaged
    pools:   15 pools, 4897 pgs
    objects: 195.64M objects, 195 TiB
    usage:   617 TiB used, 527 TiB / 1.1 PiB avail
    pgs:     4892 active+clean
             5    active+clean+scrubbing+deep
io:
    client:   2.5 MiB/s rd, 20 MiB/s wr, 665 op/s rd, 938 op/s wr

# ceph fs status
ABC - 4 clients
===========
RANK   STATE              MDS             ACTIVITY   DNS    INOS   DIRS   CAPS
 0     failed
 1    resolve  ABC.ceph04.lzlkdu               0      3      1      0
 2    resolve  ABC.ppc721.rzfmyi               0      3      1      0
 3    resolve  ABC.ceph04.jiepaw             249    252     13      0
          POOL             TYPE     USED  AVAIL
cephfs.ABC.meta  metadata  33.0G   104T
cephfs.ABC.data    data     390T   104T
DEF - 154 clients
===========
RANK      STATE                 MDS             ACTIVITY   DNS    INOS   DIRS   CAPS
 0    rejoin(laggy)  DEF.ceph06.etthum            30.9k  30.8k  5084      0
          POOL             TYPE     USED  AVAIL
cephfs.DEF.meta  metadata   190G   104T
cephfs.DEF.data    data     118T   104T
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)


The first filesystem will not get an MDS in rank 0,
we already tried to set max_msd to 1 but to no avail.

The second filesystem's MDS shows "replay" for a while and then
it crashes in the rejoin phase with:

  -92> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 handle_mds_map i am now mds.0.501522
   -91> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 handle_mds_map state change up:reconnect --> up:rejoin
   -90> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 rejoin_start
   -89> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 rejoin_joint_start
   -88> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfece err -22/0
   -87> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671eb5 err -22/0
   -86> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfed3 err -22/0
   -85> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -84> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0
   -83> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671eb5 err -22/0
   -82> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -81> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ebd err -22/-22
   -80> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ecd err -22/-22
   -79> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc9ea err -22/-22
   -78> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfed3 err -22/0
   -77> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc9c3 err -22/-22
   -76> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc978 err -22/-22
   -75> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc99d err -22/-22
   -74> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc95b err -22/-22
   -73> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -72> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0
   -71> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x20001dc7a7e err -22/-22
   -70> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012be364 err -22/-22
   -69> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22
   -68> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671eb5 err -22/0
   -67> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -66> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ebd err -22/-22
   -65> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ecd err -22/-22
   -64> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc9ea err -22/-22
   -63> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc978 err -22/-22
   -62> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x3000069373a err -22/-22
   -61> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012dc5d8 err -22/-22
   -60> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a32e8 err -22/-22
   -59> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000696952 err -22/-22
   -58> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -57> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfed3 err -22/0
   -56> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc99d err -22/-22
   -55> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -54> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x20001dc7a7e err -22/-22
   -53> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22
   -52> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012c3a0e err -22/-22
   -51> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -50> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0
   -49> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x20001dc7a7f err -22/-22
   -48> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc95b err -22/-22
   -47> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22
   -46> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012be364 err -22/-22
   -45> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b388d err -22/-22
   -44> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x10007185ac2 err -22/-22
   -43> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -42> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -41> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc99d err -22/-22
   -40> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22
   -39> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012c3a0e err -22/-22
   -38> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22
   -37> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x40000d63bff err -22/-22
   -36> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -35> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc95b err -22/-22
   -34> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012be364 err -22/-22
   -33> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x10007185ac2 err -22/-22
   -32> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -31> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x10007185ac4 err -22/-22
   -30> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -29> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22
   -28> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22
   -27> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58f4 err -22/-22
   -26> 2024-05-06T16:07:15.530+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -25> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -24> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22
   -23> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22
   -22> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58f4 err -22/-22
   -21> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a5fc4 err -22/-22
   -20> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a634d err -22/-22
   -19> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a63bf err -22/-22
   -18> 2024-05-06T16:07:15.542+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -17> 2024-05-06T16:07:15.542+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -16> 2024-05-06T16:07:15.546+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22
   -15> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22
   -14> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58f4 err -22/-22
   -13> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a5fc4 err -22/-22
   -12> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a634d err -22/-22
   -11> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a63bf err -22/-22
   -10> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bfd9c err -22/-22
    -9> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bfb78 err -22/-22
    -8> 2024-05-06T16:07:15.554+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0
    -7> 2024-05-06T16:07:15.554+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22
    -6> 2024-05-06T16:07:15.554+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b388d err -22/-22
    -5> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0
    -4> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22
    -3> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b388d err -22/-22
    -2> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x40000d5a226 err -22/-22
    -1> 2024-05-06T16:07:15.634+0000 7f1921e91700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc: In function 'void MDCache::rejoin_send_rejoins()' thread 7f1921e91700 time 2024-05-06T16:07:15.635683+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc: 4086: FAILED ceph_assert(auth >= 0)

 ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f1930ad94a3]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f1930ad9669]
 3: (MDCache::rejoin_send_rejoins()+0x216b) [0x5614ac8747eb]
 4: (MDCache::process_imported_caps()+0x1993) [0x5614ac872353]
 5: (Context::complete(int)+0xd) [0x5614ac6e182d]
 6: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
 7: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5614ac6e6f5d]
 8: (OpenFileTable::_open_ino_finish(inodeno_t, int)+0x156) [0x5614aca765a6]
 9: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
 10: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5614ac6e6f5d]
 11: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, int)+0x138) [0x5614ac867168]
 12: (MDCache::_open_ino_backtrace_fetched(inodeno_t, ceph::buffer::v15_2_0::list&, int)+0x290) [0x5614ac87ff90]
 13: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
 14: (MDSIOContextBase::complete(int)+0x534) [0x5614aca426e4]
 15: (Finisher::finisher_thread_entry()+0x18d) [0x7f1930b7884d]
 16: /lib64/libpthread.so.0(+0x81ca) [0x7f192fac81ca]
 17: clone()

How do we solve this issue?

Regards
--
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de

Tel: 030-405051-43
Fax: 030-405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux