Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Fri, 28 Jun 2024 13:30:32 +0100

Hi Dhairya,

I would be more than happy to share our corrupted journal. Has the host 
key changed for drop.ceph.com? The fingerprint I'm being sent is 
7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to the 
one in our /usr/share/ceph/known_hosts_drop.ceph.com.

Thank you for your advice as well. We've reset our MDS' journal and are 
currently in the process of a full filesystem scrub which understandably 
is taking quite a bit of time but seems to be progressing through the 
objects fine.

Thank you ever so much for all your help and please do feel free to 
follow up with us if you would like any further details about our crash!

Kindest regards,

Ivan

On 27/06/2024 12:39, Dhairya Parmar wrote:
CAUTION: This email originated from outside of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phishing@xxxxxxxxxxxxxxxxx

--

Hi Ivan,

The solution (which has been successful for us in the past) is to 
reset the journal. This would bring the fs back online and return the 
MDSes to a stable state, but some data would be lost—the data in the 
journal that hasn't been flushed to the backing store would be gone. 
Therefore, you should try to flush out as much journal data as 
possible before resetting the journal.

Here are the steps for this entire process:

1) Bring the FS offline
$ ceph fs fail <fs_name>

2) Recover dentries from journal (run it with every MDS Rank)
$ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event 
recover_dentries summary

3) Reset the journal (again with every MDS Rank)
$ cephfs-journal-tool --rank=<fs_name>:<mds-rank> journal reset

4) Bring the FS online
$ cephfs fs set <fs_name> joinable true

5) Restart the MDSes

6) Perform scrub to ensure consistency of fs
$ ceph tell mds.<fs_name>:0 scrub start <path> [scrubopts] [tag]
# you could try a recursive scrub maybe `ceph tell mds.<fs_name>:0 
scrub start / recursive`

Some important notes to keep in mind:
* Recovering dentries will take time (generally, rank 0 is the most 
time-consuming, but the rest should be quick).
* cephfs-journal-tool and metadata OSDs are bound to use a significant 
CPU percentage. This is because cephfs-journal-tool has to swig the 
journal data and flush it out to the backing store, which also makes 
the metadata operations go rampant, resulting in OSDs taking a 
significant percentage of CPU.

Do let me know how this goes.

On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> 
wrote:

    Hi Dhairya,

    We can induce the crash by simply restarting the MDS and the crash
    seems to happen when an MDS goes from up:standby to up:replay. The
    MDS works through a few files in the log before eventually
    crashing where I've included the logs for this here (this is after
    I imported the backed up journal which I hope was successful but
    please let me know if you suspect it wasn't!):
    https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log

    With respect to the client logs, are you referring to the clients
    who are writing to the filesystem? We don't typically run them in
    any sort of debug mode and we have quite a few machines running
    our backup system but we can look an hour or so before the first
    MDS crash (though I don't know if this is when the de-sync
    occurred). Here are some MDS logs with regards to the initial
    crash on Saturday morning though which may be helpful:

           -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10
        monclient: tick
           -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10
        monclient: _check_auth_rotating have uptodate secrets (they
        expire after 2024-06-22T05:41:13.091556+0100)
           -57> 2024-06-22T05:41:43.208+0100 7f184de84700  1
        mds.pebbles-s2 Updating MDS map to version 2529650 from mon.3
           -56> 2024-06-22T05:41:43.208+0100 7f184de84700  4
        mds.0.purge_queue operator():  data pool 6 not found in OSDMap
           -55> 2024-06-22T05:41:43.208+0100 7f184de84700  4
        mds.0.purge_queue operator():  data pool 3 not found in OSDMap
           -54> 2024-06-22T05:41:43.209+0100 7f184de84700  5
        asok(0x5592e7968000) register_command objecter_requests hook
        0x5592e78f8800
           -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10
        monclient: _renew_subs
           -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10
        monclient: _send_mon_message to mon.pebbles-s4 at
        v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0>
           -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10
        log_channel(cluster) update_config to_monitors: true
        to_syslog: false syslog_facility:  prio: info to_graylog:
        false graylog_host: 127.0.0.1 graylog_port: 12201)
           -50> 2024-06-22T05:41:43.209+0100 7f184de84700  4
        mds.0.purge_queue operator():  data pool 6 not found in OSDMap
           -49> 2024-06-22T05:41:43.209+0100 7f184de84700  4
        mds.0.purge_queue operator():  data pool 3 not found in OSDMap
           -48> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.0
        apply_blocklist: killed 0, blocklisted sessions (0 blocklist
        entries, 0)
           -47> 2024-06-22T05:41:43.209+0100 7f184de84700  1
        mds.0.2529650 handle_mds_map i am now mds.0.2529650
           -46> 2024-06-22T05:41:43.209+0100 7f184de84700  1
        mds.0.2529650 handle_mds_map state change up:standby --> up:replay
           -45> 2024-06-22T05:41:43.209+0100 7f184de84700  5
        mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay
           -44> 2024-06-22T05:41:43.209+0100 7f184de84700  1
        mds.0.2529650 replay_start
           -43> 2024-06-22T05:41:43.209+0100 7f184de84700  1
        mds.0.2529650  waiting for osdmap 473739 (which blocklists
        prior instance)
           -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10
        monclient: _send_mon_message to mon.pebbles-s4 at
        v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0>
           -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700  2
        mds.0.cache Memory usage:  total 299012, rss 37624, heap
        182556, baseline 182556, 0 / 0 inodes have caps, 0 caps, 0
        caps per inode
           -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10
        monclient: _renew_subs
           -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10
        monclient: _send_mon_message to mon.pebbles-s4 at
        v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0>
           -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10
        monclient: handle_get_version_reply finishing 1 version 473739
           -37> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
        mds.0.2529650 Booting: 0: opening inotable
           -36> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
        mds.0.2529650 Booting: 0: opening sessionmap
           -35> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
        mds.0.2529650 Booting: 0: opening mds log
           -34> 2024-06-22T05:41:43.224+0100 7f1847e78700  5 mds.0.log
        open discovering log bounds
           -33> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
        mds.0.2529650 Booting: 0: opening purge queue (async)
           -32> 2024-06-22T05:41:43.224+0100 7f1847e78700  4
        mds.0.purge_queue open: opening
           -31> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
        mds.0.journaler.pq(ro) recover start
           -30> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
        mds.0.journaler.pq(ro) read_head
           -29> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
        mds.0.2529650 Booting: 0: loading open file table (async)
           -28> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
        mds.0.2529650 Booting: 0: opening snap table
           -27> 2024-06-22T05:41:43.224+0100 7f1847677700  4
        mds.0.journalpointer Reading journal pointer '400.00000000'
           -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10
        monclient: get_auth_request con 0x5592e8987000 auth_method 0
           -25> 2024-06-22T05:41:43.225+0100 7f1850e8a700 10
        monclient: get_auth_request con 0x5592e8987c00 auth_method 0
           -24> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
        mds.0.journaler.pq(ro) _finish_read_head loghead(trim
        231160676352, expire 231163662875, write 231163662875,
        stream_format 1).  probing for end of log (from 231163662875)...
           -23> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
        mds.0.journaler.pq(ro) probing for end of the log
           -22> 2024-06-22T05:41:43.252+0100 7f1847677700  1
        mds.0.journaler.mdlog(ro) recover start
           -21> 2024-06-22T05:41:43.252+0100 7f1847677700  1
        mds.0.journaler.mdlog(ro) read_head
           -20> 2024-06-22T05:41:43.252+0100 7f1847677700  4 mds.0.log
        Waiting for journal 0x200 to recover...
           -19> 2024-06-22T05:41:43.252+0100 7f1850689700 10
        monclient: get_auth_request con 0x5592e8bc6000 auth_method 0
           -18> 2024-06-22T05:41:43.253+0100 7f185168b700 10
        monclient: get_auth_request con 0x5592e8bc6800 auth_method 0
           -17> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
        mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim
        90131453181952, expire 90131465778558, write 90132009715463,
        stream_format 1).  probing for end of log (from 90132009715463)...
           -16> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
        mds.0.journaler.mdlog(ro) probing for end of the log
           -15> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
        mds.0.journaler.mdlog(ro) _finish_probe_end write_pos =
        90132019384791 (header had 90132009715463). recovered.
           -14> 2024-06-22T05:41:43.257+0100 7f1847677700  4 mds.0.log
        Journal 0x200 recovered.
           -13> 2024-06-22T05:41:43.257+0100 7f1847677700  4 mds.0.log
        Recovered journal 0x200 in format 1
           -12> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
        mds.0.journaler.pq(ro) _finish_probe_end write_pos =
        231163662875 (header had 231163662875). recovered.
           -11> 2024-06-22T05:41:43.273+0100 7f1848e7a700  4
        mds.0.purge_queue operator(): open complete
           -10> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
        mds.0.journaler.pq(ro) set_writeable
            -9> 2024-06-22T05:41:43.441+0100 7f1847e78700  2
        mds.0.2529650 Booting: 1: loading/discovering base inodes
            -8> 2024-06-22T05:41:43.441+0100 7f1847e78700  0
        mds.0.cache creating system inode with ino:0x100
            -7> 2024-06-22T05:41:43.442+0100 7f1847e78700  0
        mds.0.cache creating system inode with ino:0x1
            -6> 2024-06-22T05:41:43.442+0100 7f1847e78700  2
        mds.0.2529650 Booting: 2: replaying mds log
            -5> 2024-06-22T05:41:43.442+0100 7f1847e78700  2
        mds.0.2529650 Booting: 2: waiting for purge queue recovered
            -4> 2024-06-22T05:41:44.090+0100 7f184ce82700 10
        monclient: tick
            -3> 2024-06-22T05:41:44.090+0100 7f184ce82700 10
        monclient: _check_auth_rotating have uptodate secrets (they
        expire after 2024-06-22T05:41:14.091638+0100)
            -2> 2024-06-22T05:41:44.210+0100 7f1849e7c700  2
        mds.0.cache Memory usage:  total 588368, rss 308304, heap
        207132, baseline 182556, 0 / 15149 inodes have caps, 0 caps, 0
        caps per inode
            -1> 2024-06-22T05:41:44.642+0100 7f1846675700 -1
        /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
        In function 'void interval_set<T, C>::erase(T, T,
        std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]'
        thread 7f1846675700 time 2024-06-22T05:41:44.643146+0100

         ceph version 17.2.7
        (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
         1: (ceph::__ceph_assert_fail(char const*, char const*, int,
        char const*)+0x135) [0x7f18568b64a3]
         2: /usr/lib64/ceph/libceph-common.so.2(+0x269669)
        [0x7f18568b6669]
         3: (interval_set<inodeno_t, std::map>::erase(inodeno_t,
        inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5)
        [0x5592e5027885]
         4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
        MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
         5: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
         6: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
         7: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
         8: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
         9: clone()

             0> 2024-06-22T05:41:44.643+0100 7f1846675700 -1 ***
        Caught signal (Aborted) **
         in thread 7f1846675700 thread_name:md_log_replay

         ceph version 17.2.7
        (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
         1: /lib64/libpthread.so.0(+0x12cf0) [0x7f18558aecf0]
         2: gsignal()
         3: abort()
         4: (ceph::__ceph_assert_fail(char const*, char const*, int,
        char const*)+0x18f) [0x7f18568b64fd]
         5: /usr/lib64/ceph/libceph-common.so.2(+0x269669)
        [0x7f18568b6669]
         6: (interval_set<inodeno_t, std::map>::erase(inodeno_t,
        inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5)
        [0x5592e5027885]
         7: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
        MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
         8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
         9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
         10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
         11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
         12: clone()

    We have a relatively low debug setting normally so I don't think
    many details of the initial crash were captured unfortunately and
    the MDS logs before the above (i.e. "-60" and older) are just
    beacon messages and _check_auth_rotating checks.

    I was wondering whether you have any recommendations in terms of
    what actions we could take to bring our filesystem back into a
    working state short of rebuilding the entire metadata pool? We are
    quite keen to bring our backup back into service urgently as we
    currently do not have any accessible backups for our Ceph clusters.

    Kindest regards,

    Ivan

    On 25/06/2024 19:18, Dhairya Parmar wrote:
    CAUTION: This email originated from outside of the LMB:
    *.-dparmar@xxxxxxxxxx-.*
    Do not click links or open attachments unless you recognize the
    sender and know the content is safe.
    If you think this is a phishing email, please forward it to
    phishing@xxxxxxxxxxxxxxxxx

    --

    On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson
    <ivan@xxxxxxxxxxxxxxxxx> wrote:

        Hi Dhairya,

        Thank you for your rapid reply. I tried recovering the
        dentries for the file just before the crash I mentioned
        before and then splicing the transactions from the journal
        which seemed to remove that issue for that inode but resulted
        in the MDS crashing on the next inode in the journal when
        performing replay.

    The MDS delegates a range of preallocated inodes (in form of a
    set - interval_set<inodeno_t> preallocated_inos) to the clients,
    so it can be one inode that is untracked or some inodes from the
    range or in worst case scenario - ALL, and this is something that
    even the `cephfs-journal-tool` would not be able to tell (since
    we're talking about MDS internals which aren't exposed to such
    tools). That is the reason why you see "MDS crashing on the next
    inode in the journal when performing replay".

    An option could be to expose the inode set to some tool or asok
    cmd to identify such inodes ranges, which needs to be discussed.
    For now, we're trying to address this in [0], you can follow the
    discussion there.

    [0] https://tracker.ceph.com/issues/66251

        Removing all the transactions involving the directory housing
        the files that seemed to cause these crashes from the journal
        only caused the MDS to fail to even start replay.

        I've rolled back our journal to our original version when the
        crash first happened and the entire MDS log for the crash can
        be found here:
        https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24

    Awesome, this would help us a ton. Apart from this, would it be
    possible to send us client logs?

        Please let us know if you would like any other logs file as
        we can easily induce this crash.

    Since you can easily induce the crash, can you share the
    reproducer please i.e. what all action you take in order to hit this?

        Kindest regards,

        Ivan

        On 25/06/2024 09:58, Dhairya Parmar wrote:
        CAUTION: This email originated from outside of the LMB:
        *.-dparmar@xxxxxxxxxx-.*
        Do not click links or open attachments unless you recognize
        the sender and know the content is safe.
        If you think this is a phishing email, please forward it to
        phishing@xxxxxxxxxxxxxxxxx

        --

        Hi Ivan,

        This looks to be similar to the issue [0] that we're already
        addressing at [1]. So basically there is some out-of-sync
        event that led the client to make use of the inodes that MDS
        wasn't aware of/isn't tracking and hence the crash. It'd be
        really helpful if you can provide us more logs.

        CC @Rishabh Dave <mailto:ridave@xxxxxxxxxx> @Venky Shankar
        <mailto:vshankar@xxxxxxxxxx> @Patrick Donnelly
        <mailto:pdonnell@xxxxxxxxxx> @Xiubo Li
        <mailto:xiubli@xxxxxxxxxx>

        [0] https://tracker.ceph.com/issues/61009
        [1] https://tracker.ceph.com/issues/66251
        --
        ***Dhairya Parmar*

        Associate Software Engineer, CephFS

        <https://www.redhat.com/>IBM, Inc.

        On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson
        <ivan@xxxxxxxxxxxxxxxxx> wrote:

            Hello,

            We have been experiencing a serious issue with our
            CephFS backup cluster
            running quincy (version 17.2.7) on a RHEL8-derivative
            Linux kernel
            (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our
            filesystem are
            constantly in a "replay" or "replay(laggy)" state and
            keep crashing.

            We have a single MDS filesystem called "ceph_backup"
            with 2 standby
            MDSes along with a 2nd unused filesystem "ceph_archive"
            (this holds
            little to no data) where we are using our "ceph_backup"
            filesystem to
            backup our data and this is the one which is currently
            broken. The Ceph
            health outputs currently are:

                root@pebbles-s1 14:05 [~]: ceph -s
                   cluster:
                     id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
                     health: HEALTH_WARN
                             1 filesystem is degraded
                             insufficient standby MDS daemons available
                             1319 pgs not deep-scrubbed in time
                             1054 pgs not scrubbed in time

                   services:
                     mon: 4 daemons, quorum
            pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
                     mgr: pebbles-s2(active, since 36m), standbys:
            pebbles-s4,
                pebbles-s3, pebbles-s1
                     mds: 2/2 daemons up
                     osd: 1380 osds: 1380 up (since 29m), 1379 in
            (since 3d); 37
                remapped pgs

                   data:
                     volumes: 1/2 healthy, 1 recovering
                     pools:   7 pools, 2177 pgs
                     objects: 3.55G objects, 7.0 PiB
                     usage:   8.9 PiB used, 14 PiB / 23 PiB avail
                     pgs:     83133528/30006841533 objects misplaced
            (0.277%)
                              2090 active+clean
                              47 active+clean+scrubbing+deep
                              29 active+remapped+backfilling
                              8 active+remapped+backfill_wait
                              2 active+clean+scrubbing
                              1    active+clean+snaptrim

                   io:
                     recovery: 1.9 GiB/s, 719 objects/s

                root@pebbles-s1 14:09 [~]: ceph fs status
                ceph_backup - 0 clients
                ===========
                RANK      STATE         MDS ACTIVITY   DNS    INOS  
            DIRS CAPS
                  0    replay(laggy) pebbles-s3               0     
            0 0      0
                         POOL            TYPE     USED AVAIL
                    mds_backup_fs      metadata  1255G 2780G
                ec82_primary_fs_data    data       0 2780G
                       ec82pool          data    8442T 3044T
                ceph_archive - 2 clients
                ============
                RANK  STATE      MDS ACTIVITY     DNS    INOS   DIRS
            CAPS
                  0    active  pebbles-s2  Reqs:    0 /s  13.4k 
            7105    118 2
                         POOL            TYPE     USED AVAIL
                    mds_archive_fs     metadata  5184M 2780G
                ec83_primary_fs_data    data       0 2780G
                       ec83pool          data     138T 2767T
                MDS version: ceph version 17.2.7
            (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
                root@pebbles-s1 14:09 [~]: ceph health detail | head
                HEALTH_WARN 1 filesystem is degraded; insufficient
            standby MDS
                daemons available; 1319 pgs not deep-scrubbed in
            time; 1054 pgs not
                scrubbed in time
                [WRN] FS_DEGRADED: 1 filesystem is degraded
                     fs ceph_backup is degraded
                [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby
            MDS daemons
                available
                     have 0; want 1 more

            When our cluster first ran after a reboot, Ceph ran
            through the 2
            standby MDSes, crashing them all, until it reached the
            final MDS and is
            now stuck in this "replay(laggy)" state. Putting our
            MDSes into
            debugging mode, we can see that this MDS crashed when
            replaying the
            journal for a particular inode (this is the same for all
            the MDSes and
            they all crash on the same object):

                ...
                2024-06-24T13:44:55.563+0100 7f8811c40700 10
            mds.0.journal
                EMetaBlob.replay for [521,head] had [inode 0x1005ba89481
                [...539,head]
            /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
                auth fragtree_t(*^2 00*^3 00000*^
                4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4
            00110*^4 00111*^4
                01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4
            01101*^4 01110*^4
                01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4
            10100*^4 10101*^3
                10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22
                T05:41:10.213700+0100 1281276=1281276+0) n(v12
                rc2024-06-22T05:41:10.213700+0100 b1348251683896
            1281277=1281276+1)
                old_inodes=8 (iversion lock) | dirfrag=416 dirty=1
            0x55770a2bdb80]
                2024-06-24T13:44:55.563+0100 7f8811c40700 10
            mds.0.journal
                EMetaBlob.replay dir 0x1005ba89481.011011000*
                2024-06-24T13:44:55.563+0100 7f8811c40700 10
            mds.0.journal
                EMetaBlob.replay updated dir [dir
            0x1005ba89481.011011000*
            /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
                [2,head] auth v=436385 cv=0/0 state=107374182
                4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0)
            n(v12
                rc2024-06-22T05:41:10.213700+0100 b2120744220
            2502=2502+0)
                hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80]
                2024-06-24T13:44:55.563+0100 7f8811c40700 10
            mds.0.journal
                EMetaBlob.replay added (full) [dentry
            #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
                ons_ave_Z124.mrc.teberet7.partial [539,head] auth
            NULL (dversion
                lock) v=436384 ino=(nil) state=1610612800|bottomlru
            | dirty=1
                0x557710444500]
                2024-06-24T13:44:55.563+0100 7f8811c40700 10
            mds.0.journal
                EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head]
            /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
            005006_fractions_ave_Z124.mrc.teberet7.partial auth
            v436384 s=0 n(v0
                1=1+0) (iversion lock) cr={99995144=0-4194304@538}
            0x557710438680]
                2024-06-24T13:44:55.563+0100 7f8811c40700 10
                mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent
                2024-06-24T13:44:55.563+0100 7f8811c40700 10
            mds.0.journal
                EMetaBlob.replay noting opened inode [inode
            0x1005cd4fe35 [539,head]
            /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
            6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
            auth
                v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
                cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1
            0x557710438680]
                2024-06-24T13:44:55.563+0100 7f8811c40700 10
            mds.0.journal
                EMetaBlob.replay inotable tablev 3112837 <= table
            3112837
                2024-06-24T13:44:55.563+0100 7f8811c40700 10
            mds.0.journal
                EMetaBlob.replay sessionmap v 1560540883, table
            1560540882 prealloc
                [] used 0x1005cd4fe35
                2024-06-24T13:44:55.563+0100 7f8811c40700 -1
            /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                I
                n function 'void interval_set<T, C>::erase(T, T,
                std::function<bool(T, T)>) [with T = inodeno_t; C =
            std::map]'
                thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100
            /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                568: FAILED ceph_assert(p->first <= start)

                  ceph version 17.2.7
            (b12291d110049b2f35e32e0de30d70e9a4c060d2)
                quincy (stable)
                  1: (ceph::__ceph_assert_fail(char const*, char
            const*, int, char
                const*)+0x135) [0x7f8821e814a3]
                  2: /usr/lib64/ceph/libceph-common.so.2(+0x269669)
            [0x7f8821e81669]
                  3: (interval_set<inodeno_t,
            std::map>::erase(inodeno_t, inodeno_t,
                std::function<bool (inodeno_t, inodeno_t)>)+0x2e5)
            [0x5576f9bb2885]
                  4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
                MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
                  5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1]
                  6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b]
                  7: (MDLog::ReplayThread::entry()+0x11)
            [0x5576f9af5041]
                  8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca]
                  9: clone()

            I've only included a short section of the crash (this is
            the first trace
            in the log with regards to the crash with a 10/20
            debug_mds option). We
            tried deleting the 0x1005cd4fe35 object from the object
            store using the
            "rados" command but this did not allow our MDS to
            successfully replay.

             From my understanding the journal seems okay as we
            didn't run out of
            space for example on our metadata pool and
            "cephfs-journal-tool journal
            inspect" doesn't seem to think there is any damage:

                root@pebbles-s1 13:58 [~]: cephfs-journal-tool
            --rank=ceph_backup:0
                journal inspect
                Overall journal integrity: OK
                root@pebbles-s1 14:04 [~]: cephfs-journal-tool
            --rank=ceph_backup:0
                event get --inode 1101069090357 summary
                Events by type:
                   OPEN: 1
                   UPDATE: 3
                Errors: 0
                root@pebbles-s1 14:05 [~]: cephfs-journal-tool
            --rank=ceph_backup:0
                event get --inode 1101069090357 list
                2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35
            UPDATE:  (openc)
            test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE:
                (check_inode_max_size)
            test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
            test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN:  ()
            FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                2024-06-22T05:41:15.484921+0100 0x51f97e73493b
            UPDATE:  (rename)
            test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
            test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc

            I was wondering whether anyone had any advice for us on
            how we should
            proceed forward? We were thinking about manually
            applying these events
            (via "event apply") where failing that we could erase
            this problematic
            event with "cephfs-journal-tool --rank=ceph_backup:0
            event splice
            --inode 1101069090357". Is this a good idea? We would
            rather not rebuild
            the entire metadata pool if we could avoid it (once was
            enough for us)
            as this cluster has ~9 PB of data on it.

            Kindest regards,

            Ivan Clayson

            -- 
            Ivan Clayson
            -----------------
            Scientific Computing Officer
            Room 2N249
            Structural Studies
            MRC Laboratory of Molecular Biology
            Francis Crick Ave, Cambridge
            CB2 0QH
            _______________________________________________
            ceph-users mailing list -- ceph-users@xxxxxxx
            To unsubscribe send an email to ceph-users-leave@xxxxxxx

        -- 
        Ivan Clayson
        -----------------
        Scientific Computing Officer
        Room 2N249
        Structural Studies
        MRC Laboratory of Molecular Biology
        Francis Crick Ave, Cambridge
        CB2 0QH

    -- 
    Ivan Clayson
    -----------------
    Scientific Computing Officer
    Room 2N249
    Structural Studies
    MRC Laboratory of Molecular Biology
    Francis Crick Ave, Cambridge
    CB2 0QH

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx