Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Tue, 25 Jun 2024 14:05:57 +0100

Hi Dhairya,

Thank you for your rapid reply. I tried recovering the dentries for the 
file just before the crash I mentioned before and then splicing the 
transactions from the journal which seemed to remove that issue for that 
inode but resulted in the MDS crashing on the next inode in the journal 
when performing replay. Removing all the transactions involving the 
directory housing the files that seemed to cause these crashes from the 
journal only caused the MDS to fail to even start replay.

I've rolled back our journal to our original version when the crash 
first happened and the entire MDS log for the crash can be found here: 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24

Please let us know if you would like any other logs file as we can 
easily induce this crash.

Kindest regards,

Ivan

On 25/06/2024 09:58, Dhairya Parmar wrote:
CAUTION: This email originated from outside of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phishing@xxxxxxxxxxxxxxxxx

--

Hi Ivan,

This looks to be similar to the issue [0] that we're already 
addressing at [1]. So basically there is some out-of-sync event that 
led the client to make use of the inodes that MDS wasn't aware 
of/isn't tracking and hence the crash. It'd be really helpful if you 
can provide us more logs.

CC @Rishabh Dave <mailto:ridave@xxxxxxxxxx> @Venky Shankar 
<mailto:vshankar@xxxxxxxxxx> @Patrick Donnelly 
<mailto:pdonnell@xxxxxxxxxx> @Xiubo Li <mailto:xiubli@xxxxxxxxxx>

[0] https://tracker.ceph.com/issues/61009
[1] https://tracker.ceph.com/issues/66251
--
***Dhairya Parmar*

Associate Software Engineer, CephFS

<https://www.redhat.com/>IBM, Inc.

On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> 
wrote:

    Hello,

    We have been experiencing a serious issue with our CephFS backup
    cluster
    running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel
    (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our
    filesystem are
    constantly in a "replay" or "replay(laggy)" state and keep crashing.

    We have a single MDS filesystem called "ceph_backup" with 2 standby
    MDSes along with a 2nd unused filesystem "ceph_archive" (this holds
    little to no data) where we are using our "ceph_backup" filesystem to
    backup our data and this is the one which is currently broken. The
    Ceph
    health outputs currently are:

        root@pebbles-s1 14:05 [~]: ceph -s
           cluster:
             id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
             health: HEALTH_WARN
                     1 filesystem is degraded
                     insufficient standby MDS daemons available
                     1319 pgs not deep-scrubbed in time
                     1054 pgs not scrubbed in time

           services:
             mon: 4 daemons, quorum
        pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
             mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4,
        pebbles-s3, pebbles-s1
             mds: 2/2 daemons up
             osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37
        remapped pgs

           data:
             volumes: 1/2 healthy, 1 recovering
             pools:   7 pools, 2177 pgs
             objects: 3.55G objects, 7.0 PiB
             usage:   8.9 PiB used, 14 PiB / 23 PiB avail
             pgs:     83133528/30006841533 objects misplaced (0.277%)
                      2090 active+clean
                      47   active+clean+scrubbing+deep
                      29   active+remapped+backfilling
                      8    active+remapped+backfill_wait
                      2    active+clean+scrubbing
                      1    active+clean+snaptrim

           io:
             recovery: 1.9 GiB/s, 719 objects/s

        root@pebbles-s1 14:09 [~]: ceph fs status
        ceph_backup - 0 clients
        ===========
        RANK      STATE         MDS      ACTIVITY   DNS    INOS DIRS CAPS
          0    replay(laggy)  pebbles-s3               0      0 0      0
                 POOL            TYPE     USED  AVAIL
            mds_backup_fs      metadata  1255G  2780G
        ec82_primary_fs_data    data       0   2780G
               ec82pool          data    8442T  3044T
        ceph_archive - 2 clients
        ============
        RANK  STATE      MDS         ACTIVITY     DNS    INOS DIRS CAPS
          0    active  pebbles-s2  Reqs:    0 /s  13.4k  7105 118 2
                 POOL            TYPE     USED  AVAIL
            mds_archive_fs     metadata  5184M  2780G
        ec83_primary_fs_data    data       0   2780G
               ec83pool          data     138T  2767T
        MDS version: ceph version 17.2.7
        (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
        root@pebbles-s1 14:09 [~]: ceph health detail | head
        HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS
        daemons available; 1319 pgs not deep-scrubbed in time; 1054
    pgs not
        scrubbed in time
        [WRN] FS_DEGRADED: 1 filesystem is degraded
             fs ceph_backup is degraded
        [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons
        available
             have 0; want 1 more

    When our cluster first ran after a reboot, Ceph ran through the 2
    standby MDSes, crashing them all, until it reached the final MDS
    and is
    now stuck in this "replay(laggy)" state. Putting our MDSes into
    debugging mode, we can see that this MDS crashed when replaying the
    journal for a particular inode (this is the same for all the MDSes
    and
    they all crash on the same object):

        ...
        2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
        EMetaBlob.replay for [521,head] had [inode 0x1005ba89481
        [...539,head]
    /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
        auth fragtree_t(*^2 00*^3 00000*^
        4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4
        01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4
    01110*^4
        01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4
    10101*^3
        10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22
        T05:41:10.213700+0100 1281276=1281276+0) n(v12
        rc2024-06-22T05:41:10.213700+0100 b1348251683896
    1281277=1281276+1)
        old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80]
        2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
        EMetaBlob.replay dir 0x1005ba89481.011011000*
        2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
        EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000*
    /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
        [2,head] auth v=436385 cv=0/0 state=107374182
        4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12
        rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0)
        hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80]
        2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
        EMetaBlob.replay added (full) [dentry
    #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
        ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion
        lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1
        0x557710444500]
        2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
        EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head]
    /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
        005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384
    s=0 n(v0
        1=1+0) (iversion lock) cr={99995144=0-4194304@538} 0x557710438680]
        2024-06-24T13:44:55.563+0100 7f8811c40700 10
        mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent
        2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
        EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35
    [539,head]
    /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
    6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth
        v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
        cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1
    0x557710438680]
        2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
        EMetaBlob.replay inotable tablev 3112837 <= table 3112837
        2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
        EMetaBlob.replay sessionmap v 1560540883, table 1560540882
    prealloc
        [] used 0x1005cd4fe35
        2024-06-24T13:44:55.563+0100 7f8811c40700 -1
    /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
        I
        n function 'void interval_set<T, C>::erase(T, T,
        std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]'
        thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100
    /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
        568: FAILED ceph_assert(p->first <= start)

          ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2)
        quincy (stable)
          1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
        const*)+0x135) [0x7f8821e814a3]
          2: /usr/lib64/ceph/libceph-common.so.2(+0x269669)
    [0x7f8821e81669]
          3: (interval_set<inodeno_t, std::map>::erase(inodeno_t,
    inodeno_t,
        std::function<bool (inodeno_t, inodeno_t)>)+0x2e5)
    [0x5576f9bb2885]
          4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
        MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
          5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1]
          6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b]
          7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041]
          8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca]
          9: clone()

    I've only included a short section of the crash (this is the first
    trace
    in the log with regards to the crash with a 10/20 debug_mds
    option). We
    tried deleting the 0x1005cd4fe35 object from the object store
    using the
    "rados" command but this did not allow our MDS to successfully replay.

     From my understanding the journal seems okay as we didn't run out of
    space for example on our metadata pool and "cephfs-journal-tool
    journal
    inspect" doesn't seem to think there is any damage:

        root@pebbles-s1 13:58 [~]: cephfs-journal-tool
    --rank=ceph_backup:0
        journal inspect
        Overall journal integrity: OK
        root@pebbles-s1 14:04 [~]: cephfs-journal-tool
    --rank=ceph_backup:0
        event get --inode 1101069090357 summary
        Events by type:
           OPEN: 1
           UPDATE: 3
        Errors: 0
        root@pebbles-s1 14:05 [~]: cephfs-journal-tool
    --rank=ceph_backup:0
        event get --inode 1101069090357 list
        2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE: (openc)
    test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
        2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE:
        (check_inode_max_size)
    test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
    test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
        2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN:  ()
    FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
        2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE: (rename)
    test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
    test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc

    I was wondering whether anyone had any advice for us on how we should
    proceed forward? We were thinking about manually applying these
    events
    (via "event apply") where failing that we could erase this
    problematic
    event with "cephfs-journal-tool --rank=ceph_backup:0 event splice
    --inode 1101069090357". Is this a good idea? We would rather not
    rebuild
    the entire metadata pool if we could avoid it (once was enough for
    us)
    as this cluster has ~9 PB of data on it.

    Kindest regards,

    Ivan Clayson

    -- 
    Ivan Clayson
    -----------------
    Scientific Computing Officer
    Room 2N249
    Structural Studies
    MRC Laboratory of Molecular Biology
    Francis Crick Ave, Cambridge
    CB2 0QH
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx