Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Thu, 27 Jun 2024 11:12:18 +0100

Hi Dhairya,

We can induce the crash by simply restarting the MDS and the crash seems 
to happen when an MDS goes from up:standby to up:replay. The MDS works 
through a few files in the log before eventually crashing where I've 
included the logs for this here (this is after I imported the backed up 
journal which I hope was successful but please let me know if you 
suspect it wasn't!): 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log

With respect to the client logs, are you referring to the clients who 
are writing to the filesystem? We don't typically run them in any sort 
of debug mode and we have quite a few machines running our backup system 
but we can look an hour or so before the first MDS crash (though I don't 
know if this is when the de-sync occurred). Here are some MDS logs with 
regards to the initial crash on Saturday morning though which may be 
helpful:

       -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick
       -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient:
   _check_auth_rotating have uptodate secrets (they expire after
   2024-06-22T05:41:13.091556+0100)
       -57> 2024-06-22T05:41:43.208+0100 7f184de84700  1 mds.pebbles-s2
   Updating MDS map to version 2529650 from mon.3
       -56> 2024-06-22T05:41:43.208+0100 7f184de84700  4
   mds.0.purge_queue operator():  data pool 6 not found in OSDMap
       -55> 2024-06-22T05:41:43.208+0100 7f184de84700  4
   mds.0.purge_queue operator():  data pool 3 not found in OSDMap
       -54> 2024-06-22T05:41:43.209+0100 7f184de84700  5
   asok(0x5592e7968000) register_command objecter_requests hook
   0x5592e78f8800
       -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
   _renew_subs
       -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
   _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
       -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10
   log_channel(cluster) update_config to_monitors: true to_syslog:
   false syslog_facility:  prio: info to_graylog: false graylog_host:
   127.0.0.1 graylog_port: 12201)
       -50> 2024-06-22T05:41:43.209+0100 7f184de84700  4
   mds.0.purge_queue operator():  data pool 6 not found in OSDMap
       -49> 2024-06-22T05:41:43.209+0100 7f184de84700  4
   mds.0.purge_queue operator():  data pool 3 not found in OSDMap
       -48> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.0
   apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0)
       -47> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
   handle_mds_map i am now mds.0.2529650
       -46> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
   handle_mds_map state change up:standby --> up:replay
       -45> 2024-06-22T05:41:43.209+0100 7f184de84700  5
   mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay
       -44> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
   replay_start
       -43> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650 
   waiting for osdmap 473739 (which blocklists prior instance)
       -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
   _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
       -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700  2 mds.0.cache
   Memory usage:  total 299012, rss 37624, heap 182556, baseline
   182556, 0 / 0 inodes have caps, 0 caps, 0 caps per inode
       -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
   _renew_subs
       -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
   _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
       -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
   handle_get_version_reply finishing 1 version 473739
       -37> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: opening inotable
       -36> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: opening sessionmap
       -35> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: opening mds log
       -34> 2024-06-22T05:41:43.224+0100 7f1847e78700  5 mds.0.log open
   discovering log bounds
       -33> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: opening purge queue (async)
       -32> 2024-06-22T05:41:43.224+0100 7f1847e78700  4
   mds.0.purge_queue open: opening
       -31> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
   mds.0.journaler.pq(ro) recover start
       -30> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
   mds.0.journaler.pq(ro) read_head
       -29> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: loading open file table (async)
       -28> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: opening snap table
       -27> 2024-06-22T05:41:43.224+0100 7f1847677700  4
   mds.0.journalpointer Reading journal pointer '400.00000000'
       -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10 monclient:
   get_auth_request con 0x5592e8987000 auth_method 0
       -25> 2024-06-22T05:41:43.225+0100 7f1850e8a700 10 monclient:
   get_auth_request con 0x5592e8987c00 auth_method 0
       -24> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
   mds.0.journaler.pq(ro) _finish_read_head loghead(trim 231160676352,
   expire 231163662875, write 231163662875, stream_format 1).  probing
   for end of log (from 231163662875)...
       -23> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
   mds.0.journaler.pq(ro) probing for end of the log
       -22> 2024-06-22T05:41:43.252+0100 7f1847677700  1
   mds.0.journaler.mdlog(ro) recover start
       -21> 2024-06-22T05:41:43.252+0100 7f1847677700  1
   mds.0.journaler.mdlog(ro) read_head
       -20> 2024-06-22T05:41:43.252+0100 7f1847677700  4 mds.0.log
   Waiting for journal 0x200 to recover...
       -19> 2024-06-22T05:41:43.252+0100 7f1850689700 10 monclient:
   get_auth_request con 0x5592e8bc6000 auth_method 0
       -18> 2024-06-22T05:41:43.253+0100 7f185168b700 10 monclient:
   get_auth_request con 0x5592e8bc6800 auth_method 0
       -17> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
   mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim
   90131453181952, expire 90131465778558, write 90132009715463,
   stream_format 1).  probing for end of log (from 90132009715463)...
       -16> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
   mds.0.journaler.mdlog(ro) probing for end of the log
       -15> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
   mds.0.journaler.mdlog(ro) _finish_probe_end write_pos =
   90132019384791 (header had 90132009715463). recovered.
       -14> 2024-06-22T05:41:43.257+0100 7f1847677700  4 mds.0.log
   Journal 0x200 recovered.
       -13> 2024-06-22T05:41:43.257+0100 7f1847677700  4 mds.0.log
   Recovered journal 0x200 in format 1
       -12> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
   mds.0.journaler.pq(ro) _finish_probe_end write_pos = 231163662875
   (header had 231163662875). recovered.
       -11> 2024-06-22T05:41:43.273+0100 7f1848e7a700  4
   mds.0.purge_queue operator(): open complete
       -10> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
   mds.0.journaler.pq(ro) set_writeable
        -9> 2024-06-22T05:41:43.441+0100 7f1847e78700  2 mds.0.2529650
   Booting: 1: loading/discovering base inodes
        -8> 2024-06-22T05:41:43.441+0100 7f1847e78700  0 mds.0.cache
   creating system inode with ino:0x100
        -7> 2024-06-22T05:41:43.442+0100 7f1847e78700  0 mds.0.cache
   creating system inode with ino:0x1
        -6> 2024-06-22T05:41:43.442+0100 7f1847e78700  2 mds.0.2529650
   Booting: 2: replaying mds log
        -5> 2024-06-22T05:41:43.442+0100 7f1847e78700  2 mds.0.2529650
   Booting: 2: waiting for purge queue recovered
        -4> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: tick
        -3> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient:
   _check_auth_rotating have uptodate secrets (they expire after
   2024-06-22T05:41:14.091638+0100)
        -2> 2024-06-22T05:41:44.210+0100 7f1849e7c700  2 mds.0.cache
   Memory usage:  total 588368, rss 308304, heap 207132, baseline
   182556, 0 / 15149 inodes have caps, 0 caps, 0 caps per inode
        -1> 2024-06-22T05:41:44.642+0100 7f1846675700 -1
   /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
   In function 'void interval_set<T, C>::erase(T, T,
   std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]'
   thread 7f1846675700 time 2024-06-22T05:41:44.643146+0100

     ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2)
   quincy (stable)
     1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
   const*)+0x135) [0x7f18568b64a3]
     2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669]
     3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t,
   std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885]
     4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
   MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
     5: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
     6: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
     7: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
     8: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
     9: clone()

         0> 2024-06-22T05:41:44.643+0100 7f1846675700 -1 *** Caught
   signal (Aborted) **
     in thread 7f1846675700 thread_name:md_log_replay

     ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2)
   quincy (stable)
     1: /lib64/libpthread.so.0(+0x12cf0) [0x7f18558aecf0]
     2: gsignal()
     3: abort()
     4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
   const*)+0x18f) [0x7f18568b64fd]
     5: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669]
     6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t,
   std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885]
     7: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
   MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
     8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
     9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
     10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
     11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
     12: clone()

We have a relatively low debug setting normally so I don't think many 
details of the initial crash were captured unfortunately and the MDS 
logs before the above (i.e. "-60" and older) are just beacon messages 
and _check_auth_rotating checks.

I was wondering whether you have any recommendations in terms of what 
actions we could take to bring our filesystem back into a working state 
short of rebuilding the entire metadata pool? We are quite keen to bring 
our backup back into service urgently as we currently do not have any 
accessible backups for our Ceph clusters.

Kindest regards,

Ivan

On 25/06/2024 19:18, Dhairya Parmar wrote:
CAUTION: This email originated from outside of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phishing@xxxxxxxxxxxxxxxxx

--

On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> 
wrote:

    Hi Dhairya,

    Thank you for your rapid reply. I tried recovering the dentries
    for the file just before the crash I mentioned before and then
    splicing the transactions from the journal which seemed to remove
    that issue for that inode but resulted in the MDS crashing on the
    next inode in the journal when performing replay.

The MDS delegates a range of preallocated inodes (in form of a set - 
interval_set<inodeno_t> preallocated_inos) to the clients, so it can 
be one inode that is untracked or some inodes from the range or in 
worst case scenario - ALL, and this is something that even the 
`cephfs-journal-tool` would not be able to tell (since we're talking 
about MDS internals which aren't exposed to such tools). That is the 
reason why you see "MDS crashing on the next inode in the journal when 
performing replay".

An option could be to expose the inode set to some tool or asok cmd to 
identify such inodes ranges, which needs to be discussed. For now, 
we're trying to address this in [0], you can follow the discussion there.

[0] https://tracker.ceph.com/issues/66251

    Removing all the transactions involving the directory housing the
    files that seemed to cause these crashes from the journal only
    caused the MDS to fail to even start replay.

    I've rolled back our journal to our original version when the
    crash first happened and the entire MDS log for the crash can be
    found here:
    https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24

Awesome, this would help us a ton. Apart from this, would it be 
possible to send us client logs?

    Please let us know if you would like any other logs file as we can
    easily induce this crash.

Since you can easily induce the crash, can you share the reproducer 
please i.e. what all action you take in order to hit this?

    Kindest regards,

    Ivan

    On 25/06/2024 09:58, Dhairya Parmar wrote:
    CAUTION: This email originated from outside of the LMB:
    *.-dparmar@xxxxxxxxxx-.*
    Do not click links or open attachments unless you recognize the
    sender and know the content is safe.
    If you think this is a phishing email, please forward it to
    phishing@xxxxxxxxxxxxxxxxx

    --

    Hi Ivan,

    This looks to be similar to the issue [0] that we're already
    addressing at [1]. So basically there is some out-of-sync event
    that led the client to make use of the inodes that MDS wasn't
    aware of/isn't tracking and hence the crash. It'd be really
    helpful if you can provide us more logs.

    CC @Rishabh Dave <mailto:ridave@xxxxxxxxxx> @Venky Shankar
    <mailto:vshankar@xxxxxxxxxx> @Patrick Donnelly
    <mailto:pdonnell@xxxxxxxxxx> @Xiubo Li <mailto:xiubli@xxxxxxxxxx>

    [0] https://tracker.ceph.com/issues/61009
    [1] https://tracker.ceph.com/issues/66251
    --
    ***Dhairya Parmar*

    Associate Software Engineer, CephFS

    <https://www.redhat.com/>IBM, Inc.

    On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson
    <ivan@xxxxxxxxxxxxxxxxx> wrote:

        Hello,

        We have been experiencing a serious issue with our CephFS
        backup cluster
        running quincy (version 17.2.7) on a RHEL8-derivative Linux
        kernel
        (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our
        filesystem are
        constantly in a "replay" or "replay(laggy)" state and keep
        crashing.

        We have a single MDS filesystem called "ceph_backup" with 2
        standby
        MDSes along with a 2nd unused filesystem "ceph_archive" (this
        holds
        little to no data) where we are using our "ceph_backup"
        filesystem to
        backup our data and this is the one which is currently
        broken. The Ceph
        health outputs currently are:

            root@pebbles-s1 14:05 [~]: ceph -s
               cluster:
                 id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
                 health: HEALTH_WARN
                         1 filesystem is degraded
                         insufficient standby MDS daemons available
                         1319 pgs not deep-scrubbed in time
                         1054 pgs not scrubbed in time

               services:
                 mon: 4 daemons, quorum
            pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
                 mgr: pebbles-s2(active, since 36m), standbys:
        pebbles-s4,
            pebbles-s3, pebbles-s1
                 mds: 2/2 daemons up
                 osd: 1380 osds: 1380 up (since 29m), 1379 in (since
        3d); 37
            remapped pgs

               data:
                 volumes: 1/2 healthy, 1 recovering
                 pools:   7 pools, 2177 pgs
                 objects: 3.55G objects, 7.0 PiB
                 usage:   8.9 PiB used, 14 PiB / 23 PiB avail
                 pgs:     83133528/30006841533 objects misplaced (0.277%)
                          2090 active+clean
                          47   active+clean+scrubbing+deep
                          29   active+remapped+backfilling
                          8    active+remapped+backfill_wait
                          2    active+clean+scrubbing
                          1    active+clean+snaptrim

               io:
                 recovery: 1.9 GiB/s, 719 objects/s

            root@pebbles-s1 14:09 [~]: ceph fs status
            ceph_backup - 0 clients
            ===========
            RANK      STATE         MDS      ACTIVITY DNS    INOS  
        DIRS CAPS
              0    replay(laggy)  pebbles-s3 0      0 0      0
                     POOL            TYPE     USED  AVAIL
                mds_backup_fs      metadata  1255G  2780G
            ec82_primary_fs_data    data       0   2780G
                   ec82pool          data    8442T  3044T
            ceph_archive - 2 clients
            ============
            RANK  STATE      MDS         ACTIVITY     DNS INOS   DIRS
        CAPS
              0    active  pebbles-s2  Reqs:    0 /s  13.4k 7105    118 2
                     POOL            TYPE     USED  AVAIL
                mds_archive_fs     metadata  5184M  2780G
            ec83_primary_fs_data    data       0   2780G
                   ec83pool          data     138T  2767T
            MDS version: ceph version 17.2.7
            (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
            root@pebbles-s1 14:09 [~]: ceph health detail | head
            HEALTH_WARN 1 filesystem is degraded; insufficient
        standby MDS
            daemons available; 1319 pgs not deep-scrubbed in time;
        1054 pgs not
            scrubbed in time
            [WRN] FS_DEGRADED: 1 filesystem is degraded
                 fs ceph_backup is degraded
            [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS
        daemons
            available
                 have 0; want 1 more

        When our cluster first ran after a reboot, Ceph ran through
        the 2
        standby MDSes, crashing them all, until it reached the final
        MDS and is
        now stuck in this "replay(laggy)" state. Putting our MDSes into
        debugging mode, we can see that this MDS crashed when
        replaying the
        journal for a particular inode (this is the same for all the
        MDSes and
        they all crash on the same object):

            ...
            2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
            EMetaBlob.replay for [521,head] had [inode 0x1005ba89481
            [...539,head]
        /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
            auth fragtree_t(*^2 00*^3 00000*^
            4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4
        00111*^4
            01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4
        01101*^4 01110*^4
            01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4
        10100*^4 10101*^3
            10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22
            T05:41:10.213700+0100 1281276=1281276+0) n(v12
            rc2024-06-22T05:41:10.213700+0100 b1348251683896
        1281277=1281276+1)
            old_inodes=8 (iversion lock) | dirfrag=416 dirty=1
        0x55770a2bdb80]
            2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
            EMetaBlob.replay dir 0x1005ba89481.011011000*
            2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
            EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000*
        /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
            [2,head] auth v=436385 cv=0/0 state=107374182
            4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12
            rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0)
            hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80]
            2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
            EMetaBlob.replay added (full) [dentry
        #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
            ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL
        (dversion
            lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1
            0x557710444500]
            2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
            EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head]
        /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
            005006_fractions_ave_Z124.mrc.teberet7.partial auth
        v436384 s=0 n(v0
            1=1+0) (iversion lock) cr={99995144=0-4194304@538}
        0x557710438680]
            2024-06-24T13:44:55.563+0100 7f8811c40700 10
            mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent
            2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
            EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35
        [539,head]
        /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
        6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth
            v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
            cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1
        0x557710438680]
            2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
            EMetaBlob.replay inotable tablev 3112837 <= table 3112837
            2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
            EMetaBlob.replay sessionmap v 1560540883, table
        1560540882 prealloc
            [] used 0x1005cd4fe35
            2024-06-24T13:44:55.563+0100 7f8811c40700 -1
        /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
            I
            n function 'void interval_set<T, C>::erase(T, T,
            std::function<bool(T, T)>) [with T = inodeno_t; C =
        std::map]'
            thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100
        /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
            568: FAILED ceph_assert(p->first <= start)

              ceph version 17.2.7
        (b12291d110049b2f35e32e0de30d70e9a4c060d2)
            quincy (stable)
              1: (ceph::__ceph_assert_fail(char const*, char const*,
        int, char
            const*)+0x135) [0x7f8821e814a3]
              2: /usr/lib64/ceph/libceph-common.so.2(+0x269669)
        [0x7f8821e81669]
              3: (interval_set<inodeno_t, std::map>::erase(inodeno_t,
        inodeno_t,
            std::function<bool (inodeno_t, inodeno_t)>)+0x2e5)
        [0x5576f9bb2885]
              4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
            MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
              5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1]
              6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b]
              7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041]
              8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca]
              9: clone()

        I've only included a short section of the crash (this is the
        first trace
        in the log with regards to the crash with a 10/20 debug_mds
        option). We
        tried deleting the 0x1005cd4fe35 object from the object store
        using the
        "rados" command but this did not allow our MDS to
        successfully replay.

         From my understanding the journal seems okay as we didn't
        run out of
        space for example on our metadata pool and
        "cephfs-journal-tool journal
        inspect" doesn't seem to think there is any damage:

            root@pebbles-s1 13:58 [~]: cephfs-journal-tool
        --rank=ceph_backup:0
            journal inspect
            Overall journal integrity: OK
            root@pebbles-s1 14:04 [~]: cephfs-journal-tool
        --rank=ceph_backup:0
            event get --inode 1101069090357 summary
            Events by type:
               OPEN: 1
               UPDATE: 3
            Errors: 0
            root@pebbles-s1 14:05 [~]: cephfs-journal-tool
        --rank=ceph_backup:0
            event get --inode 1101069090357 list
            2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE: 
        (openc)
        test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
            2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE:
            (check_inode_max_size)
        test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
        test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
            2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN:  ()
        FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
            2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE: 
        (rename)
        test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
        test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc

        I was wondering whether anyone had any advice for us on how
        we should
        proceed forward? We were thinking about manually applying
        these events
        (via "event apply") where failing that we could erase this
        problematic
        event with "cephfs-journal-tool --rank=ceph_backup:0 event
        splice
        --inode 1101069090357". Is this a good idea? We would rather
        not rebuild
        the entire metadata pool if we could avoid it (once was
        enough for us)
        as this cluster has ~9 PB of data on it.

        Kindest regards,

        Ivan Clayson

        -- 
        Ivan Clayson
        -----------------
        Scientific Computing Officer
        Room 2N249
        Structural Studies
        MRC Laboratory of Molecular Biology
        Francis Crick Ave, Cambridge
        CB2 0QH
        _______________________________________________
        ceph-users mailing list -- ceph-users@xxxxxxx
        To unsubscribe send an email to ceph-users-leave@xxxxxxx

    -- 
    Ivan Clayson
    -----------------
    Scientific Computing Officer
    Room 2N249
    Structural Studies
    MRC Laboratory of Molecular Biology
    Francis Crick Ave, Cambridge
    CB2 0QH

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx