CephFS MDS crashing during replay with standby MDSes crashing afterwards

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Mon, 24 Jun 2024 15:07:51 +0100

Hello,

We have been experiencing a serious issue with our CephFS backup cluster 
running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel 
(Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are 
constantly in a "replay" or "replay(laggy)" state and keep crashing.

We have a single MDS filesystem called "ceph_backup" with 2 standby 
MDSes along with a 2nd unused filesystem "ceph_archive" (this holds 
little to no data) where we are using our "ceph_backup" filesystem to 
backup our data and this is the one which is currently broken. The Ceph 
health outputs currently are:

   root@pebbles-s1 14:05 [~]: ceph -s
      cluster:
        id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
        health: HEALTH_WARN
                1 filesystem is degraded
                insufficient standby MDS daemons available
                1319 pgs not deep-scrubbed in time
                1054 pgs not scrubbed in time

      services:
        mon: 4 daemons, quorum
   pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
        mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4,
   pebbles-s3, pebbles-s1
        mds: 2/2 daemons up
        osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37
   remapped pgs

      data:
        volumes: 1/2 healthy, 1 recovering
        pools:   7 pools, 2177 pgs
        objects: 3.55G objects, 7.0 PiB
        usage:   8.9 PiB used, 14 PiB / 23 PiB avail
        pgs:     83133528/30006841533 objects misplaced (0.277%)
                 2090 active+clean
                 47   active+clean+scrubbing+deep
                 29   active+remapped+backfilling
                 8    active+remapped+backfill_wait
                 2    active+clean+scrubbing
                 1    active+clean+snaptrim

      io:
        recovery: 1.9 GiB/s, 719 objects/s

   root@pebbles-s1 14:09 [~]: ceph fs status
   ceph_backup - 0 clients
   ===========
   RANK      STATE         MDS      ACTIVITY   DNS    INOS   DIRS CAPS
     0    replay(laggy)  pebbles-s3               0      0 0      0
            POOL            TYPE     USED  AVAIL
       mds_backup_fs      metadata  1255G  2780G
   ec82_primary_fs_data    data       0   2780G
          ec82pool          data    8442T  3044T
   ceph_archive - 2 clients
   ============
   RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS CAPS
     0    active  pebbles-s2  Reqs:    0 /s  13.4k  7105    118 2
            POOL            TYPE     USED  AVAIL
       mds_archive_fs     metadata  5184M  2780G
   ec83_primary_fs_data    data       0   2780G
          ec83pool          data     138T  2767T
   MDS version: ceph version 17.2.7
   (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
   root@pebbles-s1 14:09 [~]: ceph health detail | head
   HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS
   daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs not
   scrubbed in time
   [WRN] FS_DEGRADED: 1 filesystem is degraded
        fs ceph_backup is degraded
   [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons
   available
        have 0; want 1 more

When our cluster first ran after a reboot, Ceph ran through the 2 
standby MDSes, crashing them all, until it reached the final MDS and is 
now stuck in this "replay(laggy)" state. Putting our MDSes into 
debugging mode, we can see that this MDS crashed when replaying the 
journal for a particular inode (this is the same for all the MDSes and 
they all crash on the same object):

   ...
   2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
   EMetaBlob.replay for [521,head] had [inode 0x1005ba89481
   [...539,head]
   /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
   auth fragtree_t(*^2 00*^3 00000*^
   4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4
   01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4 01110*^4
   01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4 10101*^3
   10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22
   T05:41:10.213700+0100 1281276=1281276+0) n(v12
   rc2024-06-22T05:41:10.213700+0100 b1348251683896 1281277=1281276+1)
   old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80]
   2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
   EMetaBlob.replay dir 0x1005ba89481.011011000*
   2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
   EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000*
   /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
   [2,head] auth v=436385 cv=0/0 state=107374182
   4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12
   rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0)
   hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80]
   2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
   EMetaBlob.replay added (full) [dentry
   #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
   ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion
   lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1
   0x557710444500]
   2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
   EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head]
   /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
   005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384 s=0 n(v0
   1=1+0) (iversion lock) cr={99995144=0-4194304@538} 0x557710438680]
   2024-06-24T13:44:55.563+0100 7f8811c40700 10
   mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent
   2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
   EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35 [539,head]
   /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
   6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth
   v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
   cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1 0x557710438680]
   2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
   EMetaBlob.replay inotable tablev 3112837 <= table 3112837
   2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
   EMetaBlob.replay sessionmap v 1560540883, table 1560540882 prealloc
   [] used 0x1005cd4fe35
   2024-06-24T13:44:55.563+0100 7f8811c40700 -1
   /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
   I
   n function 'void interval_set<T, C>::erase(T, T,
   std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]'
   thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100
   /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
   568: FAILED ceph_assert(p->first <= start)

     ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2)
   quincy (stable)
     1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
   const*)+0x135) [0x7f8821e814a3]
     2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f8821e81669]
     3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t,
   std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5576f9bb2885]
     4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
   MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
     5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1]
     6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b]
     7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041]
     8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca]
     9: clone()

I've only included a short section of the crash (this is the first trace 
in the log with regards to the crash with a 10/20 debug_mds option). We 
tried deleting the 0x1005cd4fe35 object from the object store using the 
"rados" command but this did not allow our MDS to successfully replay.

From my understanding the journal seems okay as we didn't run out of 
space for example on our metadata pool and "cephfs-journal-tool journal 
inspect" doesn't seem to think there is any damage:

   root@pebbles-s1 13:58 [~]: cephfs-journal-tool --rank=ceph_backup:0
   journal inspect
   Overall journal integrity: OK
   root@pebbles-s1 14:04 [~]: cephfs-journal-tool --rank=ceph_backup:0
   event get --inode 1101069090357 summary
   Events by type:
      OPEN: 1
      UPDATE: 3
   Errors: 0
   root@pebbles-s1 14:05 [~]: cephfs-journal-tool --rank=ceph_backup:0
   event get --inode 1101069090357 list
   2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE:  (openc)
   test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
   2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE:
   (check_inode_max_size)
   test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
   test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
   2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN:  ()
   FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
   2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE:  (rename)
   test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
   test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc

I was wondering whether anyone had any advice for us on how we should 
proceed forward? We were thinking about manually applying these events 
(via "event apply") where failing that we could erase this problematic 
event with "cephfs-journal-tool --rank=ceph_backup:0 event splice 
--inode 1101069090357". Is this a good idea? We would rather not rebuild 
the entire metadata pool if we could avoid it (once was enough for us) 
as this cluster has ~9 PB of data on it.

Kindest regards,

Ivan Clayson

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx