Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Dhairya Parmar <dparmar@xxxxxxxxxx> · Tue, 25 Jun 2024 14:28:29 +0530

Hi Ivan,

This looks to be similar to the issue [0] that we're already addressing at
[1]. So basically there is some out-of-sync event that led the client to
make use of the inodes that MDS wasn't aware of/isn't tracking and hence
the crash. It'd be really helpful if you can provide us more logs.

CC @Rishabh Dave <ridave@xxxxxxxxxx> @Venky Shankar
<vshankar@xxxxxxxxxx> @Patrick
Donnelly <pdonnell@xxxxxxxxxx> @Xiubo Li <xiubli@xxxxxxxxxx>

[0] https://tracker.ceph.com/issues/61009
[1] https://tracker.ceph.com/issues/66251
--
*Dhairya Parmar*

Associate Software Engineer, CephFS

<https://www.redhat.com/>IBM, Inc.

On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote:

> Hello,
>
> We have been experiencing a serious issue with our CephFS backup cluster
> running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel
> (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are
> constantly in a "replay" or "replay(laggy)" state and keep crashing.
>
> We have a single MDS filesystem called "ceph_backup" with 2 standby
> MDSes along with a 2nd unused filesystem "ceph_archive" (this holds
> little to no data) where we are using our "ceph_backup" filesystem to
> backup our data and this is the one which is currently broken. The Ceph
> health outputs currently are:
>
>     root@pebbles-s1 14:05 [~]: ceph -s
>        cluster:
>          id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
>          health: HEALTH_WARN
>                  1 filesystem is degraded
>                  insufficient standby MDS daemons available
>                  1319 pgs not deep-scrubbed in time
>                  1054 pgs not scrubbed in time
>
>        services:
>          mon: 4 daemons, quorum
>     pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
>          mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4,
>     pebbles-s3, pebbles-s1
>          mds: 2/2 daemons up
>          osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37
>     remapped pgs
>
>        data:
>          volumes: 1/2 healthy, 1 recovering
>          pools:   7 pools, 2177 pgs
>          objects: 3.55G objects, 7.0 PiB
>          usage:   8.9 PiB used, 14 PiB / 23 PiB avail
>          pgs:     83133528/30006841533 objects misplaced (0.277%)
>                   2090 active+clean
>                   47   active+clean+scrubbing+deep
>                   29   active+remapped+backfilling
>                   8    active+remapped+backfill_wait
>                   2    active+clean+scrubbing
>                   1    active+clean+snaptrim
>
>        io:
>          recovery: 1.9 GiB/s, 719 objects/s
>
>     root@pebbles-s1 14:09 [~]: ceph fs status
>     ceph_backup - 0 clients
>     ===========
>     RANK      STATE         MDS      ACTIVITY   DNS    INOS   DIRS CAPS
>       0    replay(laggy)  pebbles-s3               0      0 0      0
>              POOL            TYPE     USED  AVAIL
>         mds_backup_fs      metadata  1255G  2780G
>     ec82_primary_fs_data    data       0   2780G
>            ec82pool          data    8442T  3044T
>     ceph_archive - 2 clients
>     ============
>     RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS CAPS
>       0    active  pebbles-s2  Reqs:    0 /s  13.4k  7105    118 2
>              POOL            TYPE     USED  AVAIL
>         mds_archive_fs     metadata  5184M  2780G
>     ec83_primary_fs_data    data       0   2780G
>            ec83pool          data     138T  2767T
>     MDS version: ceph version 17.2.7
>     (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
>     root@pebbles-s1 14:09 [~]: ceph health detail | head
>     HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS
>     daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs not
>     scrubbed in time
>     [WRN] FS_DEGRADED: 1 filesystem is degraded
>          fs ceph_backup is degraded
>     [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons
>     available
>          have 0; want 1 more
>
> When our cluster first ran after a reboot, Ceph ran through the 2
> standby MDSes, crashing them all, until it reached the final MDS and is
> now stuck in this "replay(laggy)" state. Putting our MDSes into
> debugging mode, we can see that this MDS crashed when replaying the
> journal for a particular inode (this is the same for all the MDSes and
> they all crash on the same object):
>
>     ...
>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>     EMetaBlob.replay for [521,head] had [inode 0x1005ba89481
>     [...539,head]
>
> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
>     auth fragtree_t(*^2 00*^3 00000*^
>     4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4
>     01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4 01110*^4
>     01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4 10101*^3
>     10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22
>     T05:41:10.213700+0100 1281276=1281276+0) n(v12
>     rc2024-06-22T05:41:10.213700+0100 b1348251683896 1281277=1281276+1)
>     old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80]
>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>     EMetaBlob.replay dir 0x1005ba89481.011011000*
>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>     EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000*
>
> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
>     [2,head] auth v=436385 cv=0/0 state=107374182
>     4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12
>     rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0)
>     hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80]
>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>     EMetaBlob.replay added (full) [dentry
>
> #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
>     ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion
>     lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1
>     0x557710444500]
>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>     EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head]
>
> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
>     005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384 s=0 n(v0
>     1=1+0) (iversion lock) cr={99995144=0-4194304@538} 0x557710438680]
>     2024-06-24T13:44:55.563+0100 7f8811c40700 10
>     mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent
>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>     EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35 [539,head]
>
> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
>     6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth
>     v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
>     cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1 0x557710438680]
>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>     EMetaBlob.replay inotable tablev 3112837 <= table 3112837
>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>     EMetaBlob.replay sessionmap v 1560540883, table 1560540882 prealloc
>     [] used 0x1005cd4fe35
>     2024-06-24T13:44:55.563+0100 7f8811c40700 -1
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
>     I
>     n function 'void interval_set<T, C>::erase(T, T,
>     std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]'
>     thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
>     568: FAILED ceph_assert(p->first <= start)
>
>       ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2)
>     quincy (stable)
>       1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>     const*)+0x135) [0x7f8821e814a3]
>       2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f8821e81669]
>       3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t,
>     std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5576f9bb2885]
>       4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
>     MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
>       5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1]
>       6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b]
>       7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041]
>       8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca]
>       9: clone()
>
> I've only included a short section of the crash (this is the first trace
> in the log with regards to the crash with a 10/20 debug_mds option). We
> tried deleting the 0x1005cd4fe35 object from the object store using the
> "rados" command but this did not allow our MDS to successfully replay.
>
>  From my understanding the journal seems okay as we didn't run out of
> space for example on our metadata pool and "cephfs-journal-tool journal
> inspect" doesn't seem to think there is any damage:
>
>     root@pebbles-s1 13:58 [~]: cephfs-journal-tool --rank=ceph_backup:0
>     journal inspect
>     Overall journal integrity: OK
>     root@pebbles-s1 14:04 [~]: cephfs-journal-tool --rank=ceph_backup:0
>     event get --inode 1101069090357 summary
>     Events by type:
>        OPEN: 1
>        UPDATE: 3
>     Errors: 0
>     root@pebbles-s1 14:05 [~]: cephfs-journal-tool --rank=ceph_backup:0
>     event get --inode 1101069090357 list
>     2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE:  (openc)
>
> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>     2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE:
>     (check_inode_max_size)
>
> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>
> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>     2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN:  ()
>
> FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>     2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE:  (rename)
>
> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>
> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc
>
> I was wondering whether anyone had any advice for us on how we should
> proceed forward? We were thinking about manually applying these events
> (via "event apply") where failing that we could erase this problematic
> event with "cephfs-journal-tool --rank=ceph_backup:0 event splice
> --inode 1101069090357". Is this a good idea? We would rather not rebuild
> the entire metadata pool if we could avoid it (once was enough for us)
> as this cluster has ~9 PB of data on it.
>
> Kindest regards,
>
> Ivan Clayson
>
> --
> Ivan Clayson
> -----------------
> Scientific Computing Officer
> Room 2N249
> Structural Studies
> MRC Laboratory of Molecular Biology
> Francis Crick Ave, Cambridge
> CB2 0QH
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx