MDS stuck in replay and continually crashing during replay

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Thu, 3 Oct 2024 12:12:40 +0100

Hello everyone,

We're having an issue with our backup filesystem running Reef (18.2.4) 
on an Alma9.4 (version 5.14 kernel) cluster where are our filesystem 
"ceph_spare" is in a constant degraded state:

   root@pebbles-n4 11:49 [~]: ceph fs status
   ceph_spare - 360 clients
   ==========
   RANK  STATE      MDS      ACTIVITY   DNS    INOS   DIRS   CAPS
     0    replay  pebbles-s2            14.1M  6858k   240k     0
               POOL               TYPE     USED  AVAIL
           mds_spare_fs         metadata  1451G  3216G
   ec82_spare_primary_fs_data    data       0   3216G
          ec82pool_spare          data    8820T  4626T
   STANDBY MDS
     pebbles-s1
     pebbles-s2
   MDS version: ceph version 18.2.4
   (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)
   root@pebbles-n4 11:49 [~]: ceph health detail
   HEALTH_WARN 1 filesystem is degraded; 1 large omap objects; 1 MDSs
   report oversized cache; 1 MDSs behind on trimming
   [WRN] FS_DEGRADED: 1 filesystem is degraded
        fs ceph_spare is degraded
   [WRN] LARGE_OMAP_OBJECTS: 1 large omap objects
        1 large objects found in pool 'mds_spare_fs'
        Search the cluster log for 'Large omap object found' for more
   details.
   [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.pebbles-s2(mds.0): MDS cache is too large (24GB/4GB); 0
   inodes in use by clients, 0 stray files
   [WRN] MDS_TRIM: 1 MDSs behind on trimming
        mds.pebbles-s2(mds.0): Behind on trimming (69293/128)
   max_segments: 128, num_segments: 69293

Our MDS is stuck in replay constantly and crashing before finishing 
despite a restart, reboot, and wiping of the client sessions ("ceph 
config set mds mds_wipe_sessions true"). When following the replay, 
we've noticed that the MDS always crashes just before the 
"journal_read_pos" reaches the "journal_write_pos" where the MDS logs 
with a 10/20 debug report the following:

   ...
       -14> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
   EMetaBlob.replay dir 0x1002a7c9f3a.010001*
       -13> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
   EMetaBlob.replay updated dir [dir 0x1002a7c9f3a.010001*
   /cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_Sma
   llHole_F4MSM_Pu_Process/Polish/job030/MotionCorrectedMovies/
   [2,head] auth v=283167 cv=0/0 state=1073741824 f(v0
   m2024-09-30T02:20:11.052706+0100 4262=4262+0) n(v37
   rc2024-09-30T02:20:11.052706+0100
     b4168080204 4262=4262+0) hs=877+853,ss=3+0 dirty=1733 | child=1
   0x5642e8d5d680]
       -12> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 20
   mds.0.cache.dir(0x1002a7c9f3a.010001*) lookup_exact_snap (head,
   'FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks
   _plot_0012.ef980630.partial')
       -11> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 20
   mds.0.cache.dir(0x1002a7c9f3a.010001*) lookup_exact_snap (head,
   'FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks
   _plot_0012.ef980630.partial')
       -10> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 12
   mds.0.cache.dir(0x1002a7c9f3a.010001*) add_null_dentry [dentry
   #0x1/cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_SmallH
   ole_F4MSM_Pu_Process/Polish/job030/MotionCorrectedMovies/FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks_plot_0012.ef980630.partial
   [126,head] auth NULL (dversion lo
   ck) pv=0 v=283167 ino=(nil) state=1073741824 0x564f219fca00]
        -9> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
   EMetaBlob.replay added (full) [dentry
   #0x1/cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_SmallHole_F4MSM_P
   u_Process/Polish/job030/MotionCorrectedMovies/FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks_plot_0012.ef980630.partial
   [126,head] auth NULL (dversion lock) v=28316
   6 ino=(nil) state=1610612736 | dirty=1 0x564f219fca00]
        -8> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 12
   mds.0.cache.dir(0x1002a7c9f3a.010001*) link_primary_inode [dentry
   #0x1/cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_Sma
   llHole_F4MSM_Pu_Process/Polish/job030/MotionCorrectedMovies/FoilHole_46164
   89_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks_plot_0012.ef980630.partial
   [126,head] auth NULL (dversion lock) v=283166 ino=(nil)
   state=1610612736 | dirty=1 0x564f219fca00] [inode 0x1002c5f1d10
   [126,head] #1002c5f1d10 auth v283166 s=0 n(v0 1=1+0) (iversion lock)
   cr={159821641=0-4194304@125} 0x564f21a00000]
        -7> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
   EMetaBlob.replay added [inode 0x1002c5f1d10 [126,head]
   /cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_SmallHole_F4MSM_Pu_Process/Polish/job030/MotionCorrectedMovies/FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks_plot_0012.ef980630.partial
   auth v283166 s=0 n(v0 1=1+0) (iversion lock)
   cr={159821641=0-4194304@125} 0x564f21a00000]
        -6> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10
   mds.0.cache.ino(0x1002c5f1d10) mark_dirty_parent
        -5> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
   EMetaBlob.replay noting opened inode [inode 0x1002c5f1d10 [126,head]
   /cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_SmallHole_F4MSM_Pu_Process/Polish/job030/MotionCorrectedMovies/FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks_plot_0012.ef980630.partial
   auth v283166 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
   cr={159821641=0-4194304@125} | dirtyparent=1 dirty=1 0x564f21a00000]
        -4> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
   EMetaBlob.replay inotable tablev 1481899 <= table 1481899
        -3> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
   EMetaBlob.replay sessionmap v 746010300, table 746010299 prealloc []
   used 0x1002c5f1d10
        -2> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 20 mds.0.journal 
   (session prealloc
   [0x1002bdf9f74~0x22,0x1002bdfb51d~0x7c,0x1002c59def8~0xfb,0x1002c5eec0c~0x86,0x1002c5eee87~0x1f5])
        -1> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 -1
   /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.4/rpm/el9/BUILD/ceph-18.2.4/src/include/interval_set.h:
   In function 'void interval_set<T, C>::erase(T, T,
   std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]'
   thread 7fa8b6d95640 time 2024-10-02T15:20:52.495403+0100
   /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.4/rpm/el9/BUILD/ceph-18.2.4/src/include/interval_set.h:
   568: FAILED ceph_assert(p->first <= start)

     ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d)
   reef (stable)
     1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
   const*)+0x12e) [0x7fa8c416b04d]
     2: /usr/lib64/ceph/libceph-common.so.2(+0x16b20b) [0x7fa8c416b20b]
     3: /usr/bin/ceph-mds(+0x1ef0fe) [0x5642daf490fe]
     4: /usr/bin/ceph-mds(+0x1ef145) [0x5642daf49145]
     5: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
   MDPeerUpdate*)+0x4bad) [0x5642db17b67d]
     6: (EUpdate::replay(MDSRank*)+0x5d) [0x5642db18477d]
     7: (MDLog::_replay_thread()+0x75e) [0x5642db12d00e]
     8: /usr/bin/ceph-mds(+0x13c561) [0x5642dae96561]
     9: /lib64/libc.so.6(+0x89c02) [0x7fa8c3889c02]
     10: /lib64/libc.so.6(+0x10ec40) [0x7fa8c390ec40]

         0> 2024-10-02T15:20:52.495+0100 7fa8b6d95640 -1 *** Caught
   signal (Aborted) **
     in thread 7fa8b6d95640 thread_name:md_log_replay

     ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d)
   reef (stable)
     1: /lib64/libc.so.6(+0x3e6f0) [0x7fa8c383e6f0]
     2: /lib64/libc.so.6(+0x8b94c) [0x7fa8c388b94c]
     3: raise()
     4: abort()
     5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
   const*)+0x188) [0x7fa8c416b0a7]
     6: /usr/lib64/ceph/libceph-common.so.2(+0x16b20b) [0x7fa8c416b20b]
     7: /usr/bin/ceph-mds(+0x1ef0fe) [0x5642daf490fe]
     8: /usr/bin/ceph-mds(+0x1ef145) [0x5642daf49145]
     9: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
   MDPeerUpdate*)+0x4bad) [0x5642db17b67d]
     10: (EUpdate::replay(MDSRank*)+0x5d) [0x5642db18477d]
     11: (MDLog::_replay_thread()+0x75e) [0x5642db12d00e]
     12: /usr/bin/ceph-mds(+0x13c561) [0x5642dae96561]
     13: /lib64/libc.so.6(+0x89c02) [0x7fa8c3889c02]
     14: /lib64/libc.so.6(+0x10ec40) [0x7fa8c390ec40]
     NOTE: a copy of the executable, or `objdump -rdS <executable>` is
   needed to interpret this.

   --- logging levels ---
       0/ 5 none
       0/ 1 lockdep
       0/ 1 context
       1/ 1 crush
      10/20 mds
       1/ 5 mds_balancer
       1/ 5 mds_locker
       1/ 5 mds_log
       1/ 5 mds_log_expire
       1/ 5 mds_migrator
       0/ 1 buffer
       0/ 1 timer
       0/ 1 filer
       0/ 1 striper
       0/ 1 objecter
       0/ 5 rados
       0/ 5 rbd
       0/ 5 rbd_mirror
       0/ 5 rbd_replay
       0/ 5 rbd_pwl
       0/ 5 journaler
       0/ 5 objectcacher
       0/ 5 immutable_obj_cache
       0/ 5 client
       1/ 5 osd
       0/ 5 optracker
       0/ 5 objclass
       1/ 3 filestore
       1/ 3 journal
       0/ 0 ms
       1/ 5 mon
       0/10 monc
       1/ 5 paxos
       0/ 5 tp
       1/ 5 auth
       1/ 5 crypto
       1/ 1 finisher
       1/ 1 reserver
       1/ 5 heartbeatmap
       1/ 5 perfcounter
       1/ 5 rgw
       1/ 5 rgw_sync
       1/ 5 rgw_datacache
       1/ 5 rgw_access
       1/ 5 rgw_dbstore
       1/ 5 rgw_flight
       1/ 5 javaclient
       1/ 5 asok
       1/ 1 throttle
       0/ 0 refs
       1/ 5 compressor
       1/ 5 bluestore
       1/ 5 bluefs
       1/ 3 bdev
       1/ 5 kstore
       4/ 5 rocksdb
       4/ 5 leveldb
       1/ 5 fuse
       2/ 5 mgr
       1/ 5 mgrc
       1/ 5 dpdk
       1/ 5 eventtrace
       1/ 5 prioritycache
       0/ 5 test
       0/ 5 cephfs_mirror
       0/ 5 cephsqlite
       0/ 5 seastore
       0/ 5 seastore_onode
       0/ 5 seastore_odata
       0/ 5 seastore_omap
       0/ 5 seastore_tm
       0/ 5 seastore_t
       0/ 5 seastore_cleaner
       0/ 5 seastore_epm
       0/ 5 seastore_lba
       0/ 5 seastore_fixedkv_tree
       0/ 5 seastore_cache
       0/ 5 seastore_journal
       0/ 5 seastore_device
       0/ 5 seastore_backref
       0/ 5 alienstore
       1/ 5 mclock
       0/ 5 cyanstore
       1/ 5 ceph_exporter
       1/ 5 memstore
      -2/-2 (syslog threshold)
      -1/-1 (stderr threshold)
   --- pthread ID / name mapping for recent threads ---
      7fa8b6d95640 / md_log_replay
      max_recent     10000
      max_new         1000
      log_file /var/log/ceph/ceph-mds.pebbles-s3.log
   --- end dump of recent events ---

Our MDS then starts at the beginning of the replay process and 
continually re-replays the journal until it crashes again at the same point.

It looks like our journal has gotten corrupted at this file from what I 
understand and our journal (worryingly) is exceptionally large where 
we've had to use a 2 TiB machine just to try and export it. What is 
causing this issue? Can we do small modifications to the journal or 
similar to rectify this issue or move the faulty object in the journal 
out of the bulk object-store to fail (and thus skip) the transaction? We 
really do not want to go through disaster recovery again 
(https://docs.ceph.com/en/reef/cephfs/disaster-recovery-experts/#disaster-recovery-experts) 
as this is the 2nd time this has happened to this cluster in the last 4 
months and it took over a month to recover the data last time

Kindest regards,

Ivan

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx