Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Mon, 8 Jul 2024 12:17:00 +0100

Hi Dhairya,

Sorry to resurrect this thread again, but we still unfortunately have an 
issue with our filesystem after we attempted to write new backups to it.

We finished the scrub of the filesystem on Friday and ran a repair scrub 
on the 1 directory which had metadata damage. After doing so and 
rebooting, the cluster reported no issues and data was accessible again.

We re-started the backups to run over the weekend and unfortunately the 
filesystem crashed again where the log of the failure is here: 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. 
We ran the backups on kernel mounts of the filesystem without the 
nowsync option this time to avoid the out-of-sync write problems..

I've tried resetting the journal again after recovering the dentries but 
unfortunately the filesystem is still in a failed state despite setting 
joinable to true. The log of this crash is here: 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.

I'm not sure how to proceed as I can't seem to get any MDS to take over 
the first rank. I would like to do a scrub of the filesystem and 
preferably overwrite the troublesome files with the originals on the 
live filesystem. Do you have any advice on how to make the filesystem 
leave its failed state? I have a backup of the journal before I reset it 
so I can roll back if necessary.

Here are some details about the filesystem at present:

   root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
      cluster:
        id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
        health: HEALTH_ERR
                1 filesystem is degraded
                1 large omap objects
                1 filesystem is offline
                1 mds daemon damaged
   nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
   flag(s) set
                1750 pgs not deep-scrubbed in time
                1612 pgs not scrubbed in time

      services:
        mon: 4 daemons, quorum
   pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
        mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1,
   pebbles-s3, pebbles-s4
        mds: 1/2 daemons up, 3 standby
        osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10
   remapped pgs
             flags
   nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim

      data:
        volumes: 1/2 healthy, 1 recovering; 1 damaged
        pools:   7 pools, 2177 pgs
        objects: 3.24G objects, 6.7 PiB
        usage:   8.6 PiB used, 14 PiB / 23 PiB avail
        pgs:     11785954/27384310061 objects misplaced (0.043%)
                 2167 active+clean
                 6    active+remapped+backfilling
                 4    active+remapped+backfill_wait

   ceph_backup - 0 clients
   ===========
   RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
     0    failed
            POOL            TYPE     USED  AVAIL
       mds_backup_fs      metadata  1174G  3071G
   ec82_primary_fs_data    data       0   3071G
          ec82pool          data    8085T  4738T
   ceph_archive - 2 clients
   ============
   RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS CAPS
     0    active  pebbles-s4  Reqs:    0 /s  13.4k  7105    118 2
            POOL            TYPE     USED  AVAIL
       mds_archive_fs     metadata  5184M  3071G
   ec83_primary_fs_data    data       0   3071G
          ec83pool          data     138T  4307T
   STANDBY MDS
     pebbles-s2
     pebbles-s3
     pebbles-s1
   MDS version: ceph version 17.2.7
   (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
   root@pebbles-s2 11:55 [~]: ceph fs dump
   e2643889
   enable_multiple, ever_enabled_multiple: 1,1
   default compat: compat={},rocompat={},incompat={1=base
   v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir
   inode in separate object,5=mds uses versioned encoding,6=dirfrag is
   stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
   legacy client fscid: 1

   Filesystem 'ceph_backup' (1)
   fs_name    ceph_backup
   epoch    2643888
   flags    12 joinable allow_snaps allow_multimds_snaps
   created    2023-05-19T12:52:36.302135+0100
   modified    2024-07-08T11:17:55.437861+0100
   tableserver    0
   root    0
   session_timeout    60
   session_autoclose    300
   max_file_size    109934182400000
   required_client_features    {}
   last_failure    0
   last_failure_osd_epoch    494515
   compat    compat={},rocompat={},incompat={1=base v0.20,2=client
   writeable ranges,3=default file layouts on dirs,4=dir inode in
   separate object,5=mds uses versioned encoding,6=dirfrag is stored in
   omap,7=mds uses inline data,8=no anchor table,9=file layout
   v2,10=snaprealm v2}
   max_mds    1
   in    0
   up    {}
   failed
   damaged    0
   stopped
   data_pools    [6,3]
   metadata_pool    2
   inline_data    disabled
   balancer
   standby_count_wanted    1

Kindest regards,

Ivan

On 28/06/2024 15:17, Dhairya Parmar wrote:
CAUTION: This email originated from outside of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phishing@xxxxxxxxxxxxxxxxx

--

On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> 
wrote:

    Hi Dhairya,

    I would be more than happy to share our corrupted journal. Has the
    host key changed for drop.ceph.com <http://drop.ceph.com>? The
    fingerprint I'm being sent is
    7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to
    the one in our /usr/share/ceph/known_hosts_drop.ceph.com
    <http://known_hosts_drop.ceph.com>.

Ah, strange. Let me get in touch with folks who might know about this, 
will revert back to you ASAP

    Thank you for your advice as well. We've reset our MDS' journal
    and are currently in the process of a full filesystem scrub which
    understandably is taking quite a bit of time but seems to be
    progressing through the objects fine.

YAY!

    Thank you ever so much for all your help and please do feel free
    to follow up with us if you would like any further details about
    our crash!

Glad to hear it went well, this bug is being worked on with high 
priority and once the patch is ready, it will be backported.

The root cause of this issue is the `nowsync` (async dirops) being 
enabled by default with kclient [0]. This feature allows asynchronous 
creation and deletion of files, optimizing performance by avoiding 
round-trip latency for these system calls. However, in very rare cases 
(like yours :D), it can affect the system's consistency and stability 
hence if this kind of optimization is not a priority for your 
workload, I recommend turning it off by switching the mount points to 
`wsync` and also set the MDS config `mds_client_delegate_inos_pct` to 
`0` so that you don't end up in this situation again (until the bug 
fix arrives :)).

[0] 
https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902

    Kindest regards,

    Ivan

    On 27/06/2024 12:39, Dhairya Parmar wrote:
    CAUTION: This email originated from outside of the LMB:
    *.-dparmar@xxxxxxxxxx-.*
    Do not click links or open attachments unless you recognize the
    sender and know the content is safe.
    If you think this is a phishing email, please forward it to
    phishing@xxxxxxxxxxxxxxxxx

    --

    Hi Ivan,

    The solution (which has been successful for us in the past) is to
    reset the journal. This would bring the fs back online and return
    the MDSes to a stable state, but some data would be lost—the data
    in the journal that hasn't been flushed to the backing store
    would be gone. Therefore, you should try to flush out as much
    journal data as possible before resetting the journal.

    Here are the steps for this entire process:

    1) Bring the FS offline
    $ ceph fs fail <fs_name>

    2) Recover dentries from journal (run it with every MDS Rank)
    $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event
    recover_dentries summary

    3) Reset the journal (again with every MDS Rank)
    $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> journal reset

    4) Bring the FS online
    $ cephfs fs set <fs_name> joinable true

    5) Restart the MDSes

    6) Perform scrub to ensure consistency of fs
    $ ceph tell mds.<fs_name>:0 scrub start <path> [scrubopts] [tag]
    # you could try a recursive scrub maybe `ceph tell
    mds.<fs_name>:0 scrub start / recursive`

    Some important notes to keep in mind:
    * Recovering dentries will take time (generally, rank 0 is the
    most time-consuming, but the rest should be quick).
    * cephfs-journal-tool and metadata OSDs are bound to use a
    significant CPU percentage. This is because cephfs-journal-tool
    has to swig the journal data and flush it out to the backing
    store, which also makes the metadata operations go rampant,
    resulting in OSDs taking a significant percentage of CPU.

    Do let me know how this goes.

    On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson
    <ivan@xxxxxxxxxxxxxxxxx> wrote:

        Hi Dhairya,

        We can induce the crash by simply restarting the MDS and the
        crash seems to happen when an MDS goes from up:standby to
        up:replay. The MDS works through a few files in the log
        before eventually crashing where I've included the logs for
        this here (this is after I imported the backed up journal
        which I hope was successful but please let me know if you
        suspect it wasn't!):
        https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log

        With respect to the client logs, are you referring to the
        clients who are writing to the filesystem? We don't typically
        run them in any sort of debug mode and we have quite a few
        machines running our backup system but we can look an hour or
        so before the first MDS crash (though I don't know if this is
        when the de-sync occurred). Here are some MDS logs with
        regards to the initial crash on Saturday morning though which
        may be helpful:

               -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10
            monclient: tick
               -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10
            monclient: _check_auth_rotating have uptodate secrets
            (they expire after 2024-06-22T05:41:13.091556+0100)
               -57> 2024-06-22T05:41:43.208+0100 7f184de84700  1
            mds.pebbles-s2 Updating MDS map to version 2529650 from mon.3
               -56> 2024-06-22T05:41:43.208+0100 7f184de84700  4
            mds.0.purge_queue operator(): data pool 6 not found in OSDMap
               -55> 2024-06-22T05:41:43.208+0100 7f184de84700  4
            mds.0.purge_queue operator(): data pool 3 not found in OSDMap
               -54> 2024-06-22T05:41:43.209+0100 7f184de84700  5
            asok(0x5592e7968000) register_command objecter_requests
            hook 0x5592e78f8800
               -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10
            monclient: _renew_subs
               -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10
            monclient: _send_mon_message to mon.pebbles-s4 at
            v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0>
               -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10
            log_channel(cluster) update_config to_monitors: true
            to_syslog: false syslog_facility:  prio: info to_graylog:
            false graylog_host: 127.0.0.1 graylog_port: 12201)
               -50> 2024-06-22T05:41:43.209+0100 7f184de84700  4
            mds.0.purge_queue operator(): data pool 6 not found in OSDMap
               -49> 2024-06-22T05:41:43.209+0100 7f184de84700  4
            mds.0.purge_queue operator(): data pool 3 not found in OSDMap
               -48> 2024-06-22T05:41:43.209+0100 7f184de84700  4
            mds.0.0 apply_blocklist: killed 0, blocklisted sessions
            (0 blocklist entries, 0)
               -47> 2024-06-22T05:41:43.209+0100 7f184de84700  1
            mds.0.2529650 handle_mds_map i am now mds.0.2529650
               -46> 2024-06-22T05:41:43.209+0100 7f184de84700  1
            mds.0.2529650 handle_mds_map state change up:standby -->
            up:replay
               -45> 2024-06-22T05:41:43.209+0100 7f184de84700  5
            mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay
               -44> 2024-06-22T05:41:43.209+0100 7f184de84700  1
            mds.0.2529650 replay_start
               -43> 2024-06-22T05:41:43.209+0100 7f184de84700  1
            mds.0.2529650  waiting for osdmap 473739 (which
            blocklists prior instance)
               -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10
            monclient: _send_mon_message to mon.pebbles-s4 at
            v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0>
               -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700  2
            mds.0.cache Memory usage: total 299012, rss 37624, heap
            182556, baseline 182556, 0 / 0 inodes have caps, 0 caps,
            0 caps per inode
               -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10
            monclient: _renew_subs
               -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10
            monclient: _send_mon_message to mon.pebbles-s4 at
            v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0>
               -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10
            monclient: handle_get_version_reply finishing 1 version
            473739
               -37> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
            mds.0.2529650 Booting: 0: opening inotable
               -36> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
            mds.0.2529650 Booting: 0: opening sessionmap
               -35> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
            mds.0.2529650 Booting: 0: opening mds log
               -34> 2024-06-22T05:41:43.224+0100 7f1847e78700  5
            mds.0.log open discovering log bounds
               -33> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
            mds.0.2529650 Booting: 0: opening purge queue (async)
               -32> 2024-06-22T05:41:43.224+0100 7f1847e78700  4
            mds.0.purge_queue open: opening
               -31> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
            mds.0.journaler.pq(ro) recover start
               -30> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
            mds.0.journaler.pq(ro) read_head
               -29> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
            mds.0.2529650 Booting: 0: loading open file table (async)
               -28> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
            mds.0.2529650 Booting: 0: opening snap table
               -27> 2024-06-22T05:41:43.224+0100 7f1847677700  4
            mds.0.journalpointer Reading journal pointer '400.00000000'
               -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10
            monclient: get_auth_request con 0x5592e8987000 auth_method 0
               -25> 2024-06-22T05:41:43.225+0100 7f1850e8a700 10
            monclient: get_auth_request con 0x5592e8987c00 auth_method 0
               -24> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
            mds.0.journaler.pq(ro) _finish_read_head loghead(trim
            231160676352, expire 231163662875, write 231163662875,
            stream_format 1).  probing for end of log (from
            231163662875)...
               -23> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
            mds.0.journaler.pq(ro) probing for end of the log
               -22> 2024-06-22T05:41:43.252+0100 7f1847677700  1
            mds.0.journaler.mdlog(ro) recover start
               -21> 2024-06-22T05:41:43.252+0100 7f1847677700  1
            mds.0.journaler.mdlog(ro) read_head
               -20> 2024-06-22T05:41:43.252+0100 7f1847677700  4
            mds.0.log Waiting for journal 0x200 to recover...
               -19> 2024-06-22T05:41:43.252+0100 7f1850689700 10
            monclient: get_auth_request con 0x5592e8bc6000 auth_method 0
               -18> 2024-06-22T05:41:43.253+0100 7f185168b700 10
            monclient: get_auth_request con 0x5592e8bc6800 auth_method 0
               -17> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
            mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim
            90131453181952, expire 90131465778558, write
            90132009715463, stream_format 1).  probing for end of log
            (from 90132009715463)...
               -16> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
            mds.0.journaler.mdlog(ro) probing for end of the log
               -15> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
            mds.0.journaler.mdlog(ro) _finish_probe_end write_pos =
            90132019384791 (header had 90132009715463). recovered.
               -14> 2024-06-22T05:41:43.257+0100 7f1847677700  4
            mds.0.log Journal 0x200 recovered.
               -13> 2024-06-22T05:41:43.257+0100 7f1847677700  4
            mds.0.log Recovered journal 0x200 in format 1
               -12> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
            mds.0.journaler.pq(ro) _finish_probe_end write_pos =
            231163662875 (header had 231163662875). recovered.
               -11> 2024-06-22T05:41:43.273+0100 7f1848e7a700  4
            mds.0.purge_queue operator(): open complete
               -10> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
            mds.0.journaler.pq(ro) set_writeable
                -9> 2024-06-22T05:41:43.441+0100 7f1847e78700  2
            mds.0.2529650 Booting: 1: loading/discovering base inodes
                -8> 2024-06-22T05:41:43.441+0100 7f1847e78700  0
            mds.0.cache creating system inode with ino:0x100
                -7> 2024-06-22T05:41:43.442+0100 7f1847e78700  0
            mds.0.cache creating system inode with ino:0x1
                -6> 2024-06-22T05:41:43.442+0100 7f1847e78700  2
            mds.0.2529650 Booting: 2: replaying mds log
                -5> 2024-06-22T05:41:43.442+0100 7f1847e78700  2
            mds.0.2529650 Booting: 2: waiting for purge queue recovered
                -4> 2024-06-22T05:41:44.090+0100 7f184ce82700 10
            monclient: tick
                -3> 2024-06-22T05:41:44.090+0100 7f184ce82700 10
            monclient: _check_auth_rotating have uptodate secrets
            (they expire after 2024-06-22T05:41:14.091638+0100)
                -2> 2024-06-22T05:41:44.210+0100 7f1849e7c700  2
            mds.0.cache Memory usage: total 588368, rss 308304, heap
            207132, baseline 182556, 0 / 15149 inodes have caps, 0
            caps, 0 caps per inode
                -1> 2024-06-22T05:41:44.642+0100 7f1846675700 -1
            /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
            In function 'void interval_set<T, C>::erase(T, T,
            std::function<bool(T, T)>) [with T = inodeno_t; C =
            std::map]' thread 7f1846675700 time
            2024-06-22T05:41:44.643146+0100

             ceph version 17.2.7
            (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
             1: (ceph::__ceph_assert_fail(char const*, char const*,
            int, char const*)+0x135) [0x7f18568b64a3]
             2: /usr/lib64/ceph/libceph-common.so.2(+0x269669)
            [0x7f18568b6669]
             3: (interval_set<inodeno_t, std::map>::erase(inodeno_t,
            inodeno_t, std::function<bool (inodeno_t,
            inodeno_t)>)+0x2e5) [0x5592e5027885]
             4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
            MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
             5: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
             6: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
             7: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
             8: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
             9: clone()

                 0> 2024-06-22T05:41:44.643+0100 7f1846675700 -1 ***
            Caught signal (Aborted) **
             in thread 7f1846675700 thread_name:md_log_replay

             ceph version 17.2.7
            (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
             1: /lib64/libpthread.so.0(+0x12cf0) [0x7f18558aecf0]
             2: gsignal()
             3: abort()
             4: (ceph::__ceph_assert_fail(char const*, char const*,
            int, char const*)+0x18f) [0x7f18568b64fd]
             5: /usr/lib64/ceph/libceph-common.so.2(+0x269669)
            [0x7f18568b6669]
             6: (interval_set<inodeno_t, std::map>::erase(inodeno_t,
            inodeno_t, std::function<bool (inodeno_t,
            inodeno_t)>)+0x2e5) [0x5592e5027885]
             7: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
            MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
             8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
             9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
             10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
             11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
             12: clone()

        We have a relatively low debug setting normally so I don't
        think many details of the initial crash were captured
        unfortunately and the MDS logs before the above (i.e. "-60"
        and older) are just beacon messages and _check_auth_rotating
        checks.

        I was wondering whether you have any recommendations in terms
        of what actions we could take to bring our filesystem back
        into a working state short of rebuilding the entire metadata
        pool? We are quite keen to bring our backup back into service
        urgently as we currently do not have any accessible backups
        for our Ceph clusters.

        Kindest regards,

        Ivan

        On 25/06/2024 19:18, Dhairya Parmar wrote:
        CAUTION: This email originated from outside of the LMB:
        *.-dparmar@xxxxxxxxxx-.*
        Do not click links or open attachments unless you recognize
        the sender and know the content is safe.
        If you think this is a phishing email, please forward it to
        phishing@xxxxxxxxxxxxxxxxx

        --

        On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson
        <ivan@xxxxxxxxxxxxxxxxx> wrote:

            Hi Dhairya,

            Thank you for your rapid reply. I tried recovering the
            dentries for the file just before the crash I mentioned
            before and then splicing the transactions from the
            journal which seemed to remove that issue for that inode
            but resulted in the MDS crashing on the next inode in
            the journal when performing replay.

        The MDS delegates a range of preallocated inodes (in form of
        a set - interval_set<inodeno_t> preallocated_inos) to the
        clients, so it can be one inode that is untracked or some
        inodes from the range or in worst case scenario - ALL, and
        this is something that even the `cephfs-journal-tool` would
        not be able to tell (since we're talking about MDS internals
        which aren't exposed to such tools). That is the reason why
        you see "MDS crashing on the next inode in the journal when
        performing replay".

        An option could be to expose the inode set to some tool or
        asok cmd to identify such inodes ranges, which needs to be
        discussed. For now, we're trying to address this in [0], you
        can follow the discussion there.

        [0] https://tracker.ceph.com/issues/66251

            Removing all the transactions involving the directory
            housing the files that seemed to cause these crashes
            from the journal only caused the MDS to fail to even
            start replay.

            I've rolled back our journal to our original version
            when the crash first happened and the entire MDS log for
            the crash can be found here:
            https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24

        Awesome, this would help us a ton. Apart from this, would it
        be possible to send us client logs?

            Please let us know if you would like any other logs file
            as we can easily induce this crash.

        Since you can easily induce the crash, can you share the
        reproducer please i.e. what all action you take in order to
        hit this?

            Kindest regards,

            Ivan

            On 25/06/2024 09:58, Dhairya Parmar wrote:
            CAUTION: This email originated from outside of the LMB:
            *.-dparmar@xxxxxxxxxx-.*
            Do not click links or open attachments unless you
            recognize the sender and know the content is safe.
            If you think this is a phishing email, please forward
            it to phishing@xxxxxxxxxxxxxxxxx

            --

            Hi Ivan,

            This looks to be similar to the issue [0] that we're
            already addressing at [1]. So basically there is some
            out-of-sync event that led the client to make use of
            the inodes that MDS wasn't aware of/isn't tracking and
            hence the crash. It'd be really helpful if you can
            provide us more logs.

            CC @Rishabh Dave <mailto:ridave@xxxxxxxxxx> @Venky
            Shankar <mailto:vshankar@xxxxxxxxxx> @Patrick Donnelly
            <mailto:pdonnell@xxxxxxxxxx> @Xiubo Li
            <mailto:xiubli@xxxxxxxxxx>

            [0] https://tracker.ceph.com/issues/61009
            [1] https://tracker.ceph.com/issues/66251
            --
            ***Dhairya Parmar*

            Associate Software Engineer, CephFS

            <https://www.redhat.com/>IBM, Inc.

            On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson
            <ivan@xxxxxxxxxxxxxxxxx> wrote:

                Hello,

                We have been experiencing a serious issue with our
                CephFS backup cluster
                running quincy (version 17.2.7) on a
                RHEL8-derivative Linux kernel
                (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes
                for our filesystem are
                constantly in a "replay" or "replay(laggy)" state
                and keep crashing.

                We have a single MDS filesystem called
                "ceph_backup" with 2 standby
                MDSes along with a 2nd unused filesystem
                "ceph_archive" (this holds
                little to no data) where we are using our
                "ceph_backup" filesystem to
                backup our data and this is the one which is
                currently broken. The Ceph
                health outputs currently are:

                    root@pebbles-s1 14:05 [~]: ceph -s
                       cluster:
                         id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
                         health: HEALTH_WARN
                                 1 filesystem is degraded
                                 insufficient standby MDS daemons
                available
                                 1319 pgs not deep-scrubbed in time
                                 1054 pgs not scrubbed in time

                       services:
                         mon: 4 daemons, quorum
                pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
                         mgr: pebbles-s2(active, since 36m),
                standbys: pebbles-s4,
                    pebbles-s3, pebbles-s1
                         mds: 2/2 daemons up
                         osd: 1380 osds: 1380 up (since 29m), 1379
                in (since 3d); 37
                    remapped pgs

                       data:
                         volumes: 1/2 healthy, 1 recovering
                         pools:   7 pools, 2177 pgs
                         objects: 3.55G objects, 7.0 PiB
                         usage:   8.9 PiB used, 14 PiB / 23 PiB avail
                         pgs: 83133528/30006841533 objects
                misplaced (0.277%)
                                  2090 active+clean
                                  47 active+clean+scrubbing+deep
                                  29 active+remapped+backfilling
                                  8 active+remapped+backfill_wait
                                  2 active+clean+scrubbing
                                  1 active+clean+snaptrim

                       io:
                         recovery: 1.9 GiB/s, 719 objects/s

                    root@pebbles-s1 14:09 [~]: ceph fs status
                    ceph_backup - 0 clients
                    ===========
                    RANK      STATE MDS      ACTIVITY   DNS    INOS
                DIRS CAPS
                      0    replay(laggy) pebbles-s3              
                0      0 0      0
                             POOL TYPE     USED  AVAIL
                        mds_backup_fs metadata  1255G  2780G
                    ec82_primary_fs_data data       0   2780G
                           ec82pool data    8442T  3044T
                    ceph_archive - 2 clients
                    ============
                    RANK  STATE      MDS ACTIVITY     DNS    INOS  
                DIRS CAPS
                      0    active  pebbles-s2 Reqs:    0 /s  13.4k 
                7105    118 2
                             POOL TYPE     USED  AVAIL
                        mds_archive_fs metadata  5184M  2780G
                    ec83_primary_fs_data data       0   2780G
                           ec83pool data     138T  2767T
                    MDS version: ceph version 17.2.7
                (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
                (stable)
                    root@pebbles-s1 14:09 [~]: ceph health detail |
                head
                    HEALTH_WARN 1 filesystem is degraded;
                insufficient standby MDS
                    daemons available; 1319 pgs not deep-scrubbed
                in time; 1054 pgs not
                    scrubbed in time
                    [WRN] FS_DEGRADED: 1 filesystem is degraded
                         fs ceph_backup is degraded
                    [WRN] MDS_INSUFFICIENT_STANDBY: insufficient
                standby MDS daemons
                    available
                         have 0; want 1 more

                When our cluster first ran after a reboot, Ceph ran
                through the 2
                standby MDSes, crashing them all, until it reached
                the final MDS and is
                now stuck in this "replay(laggy)" state. Putting
                our MDSes into
                debugging mode, we can see that this MDS crashed
                when replaying the
                journal for a particular inode (this is the same
                for all the MDSes and
                they all crash on the same object):

                    ...
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                mds.0.journal
                    EMetaBlob.replay for [521,head] had [inode
                0x1005ba89481
                    [...539,head]
                /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
                    auth fragtree_t(*^2 00*^3 00000*^
                    4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4
                00110*^4 00111*^4
                    01*^3 01000*^4 01001*^3 01010*^4 01011*^3
                01100*^4 01101*^4 01110*^4
                    01111*^4 10*^3 10000*^4 10001*^4 10010*^4
                10011*^4 10100*^4 10101*^3
                    10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22
                    T05:41:10.213700+0100 1281276=1281276+0) n(v12
                rc2024-06-22T05:41:10.213700+0100 b1348251683896
                1281277=1281276+1)
                    old_inodes=8 (iversion lock) | dirfrag=416
                dirty=1 0x55770a2bdb80]
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                mds.0.journal
                    EMetaBlob.replay dir 0x1005ba89481.011011000*
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                mds.0.journal
                    EMetaBlob.replay updated dir [dir
                0x1005ba89481.011011000*
                /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
                    [2,head] auth v=436385 cv=0/0 state=107374182
                    4 f(v0 m2024-06-22T05:41:10.213700+0100
                2502=2502+0) n(v12
                rc2024-06-22T05:41:10.213700+0100 b2120744220
                2502=2502+0)
                    hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80]
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                mds.0.journal
                    EMetaBlob.replay added (full) [dentry
                #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
                ons_ave_Z124.mrc.teberet7.partial [539,head] auth
                NULL (dversion
                    lock) v=436384 ino=(nil)
                state=1610612800|bottomlru | dirty=1
                    0x557710444500]
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                mds.0.journal
                    EMetaBlob.replay added [inode 0x1005cd4fe35
                [539,head]
                /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
                005006_fractions_ave_Z124.mrc.teberet7.partial auth
                v436384 s=0 n(v0
                    1=1+0) (iversion lock)
                cr={99995144=0-4194304@538} 0x557710438680]
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                    mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                mds.0.journal
                    EMetaBlob.replay noting opened inode [inode
                0x1005cd4fe35 [539,head]
                /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
                6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                auth
                    v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
                    cr={99995144=0-4194304@538} | dirtyparent=1
                dirty=1 0x557710438680]
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                mds.0.journal
                    EMetaBlob.replay inotable tablev 3112837 <=
                table 3112837
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                mds.0.journal
                    EMetaBlob.replay sessionmap v 1560540883, table
                1560540882 prealloc
                    [] used 0x1005cd4fe35
                    2024-06-24T13:44:55.563+0100 7f8811c40700 -1
                /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                    I
                    n function 'void interval_set<T, C>::erase(T, T,
                    std::function<bool(T, T)>) [with T = inodeno_t;
                C = std::map]'
                    thread 7f8811c40700 time
                2024-06-24T13:44:55.564315+0100
                /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                    568: FAILED ceph_assert(p->first <= start)

                      ceph version 17.2.7
                (b12291d110049b2f35e32e0de30d70e9a4c060d2)
                    quincy (stable)
                      1: (ceph::__ceph_assert_fail(char const*,
                char const*, int, char
                    const*)+0x135) [0x7f8821e814a3]
                      2:
                /usr/lib64/ceph/libceph-common.so.2(+0x269669)
                [0x7f8821e81669]
                      3: (interval_set<inodeno_t,
                std::map>::erase(inodeno_t, inodeno_t,
                    std::function<bool (inodeno_t,
                inodeno_t)>)+0x2e5) [0x5576f9bb2885]
                      4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
                    MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
                      5: (EUpdate::replay(MDSRank*)+0x61)
                [0x5576f9ebbbd1]
                      6: (MDLog::_replay_thread()+0x7bb)
                [0x5576f9e4254b]
                      7: (MDLog::ReplayThread::entry()+0x11)
                [0x5576f9af5041]
                      8: /lib64/libpthread.so.0(+0x81ca)
                [0x7f8820e6f1ca]
                      9: clone()

                I've only included a short section of the crash
                (this is the first trace
                in the log with regards to the crash with a 10/20
                debug_mds option). We
                tried deleting the 0x1005cd4fe35 object from the
                object store using the
                "rados" command but this did not allow our MDS to
                successfully replay.

                 From my understanding the journal seems okay as we
                didn't run out of
                space for example on our metadata pool and
                "cephfs-journal-tool journal
                inspect" doesn't seem to think there is any damage:

                    root@pebbles-s1 13:58 [~]: cephfs-journal-tool
                --rank=ceph_backup:0
                    journal inspect
                    Overall journal integrity: OK
                    root@pebbles-s1 14:04 [~]: cephfs-journal-tool
                --rank=ceph_backup:0
                    event get --inode 1101069090357 summary
                    Events by type:
                       OPEN: 1
                       UPDATE: 3
                    Errors: 0
                    root@pebbles-s1 14:05 [~]: cephfs-journal-tool
                --rank=ceph_backup:0
                    event get --inode 1101069090357 list
                2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35
                UPDATE:  (openc)
                test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE:
                    (check_inode_max_size)
                test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                2024-06-22T05:41:15.484871+0100 0x51f97e7344cc
                OPEN:  ()
                FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                2024-06-22T05:41:15.484921+0100 0x51f97e73493b
                UPDATE:  (rename)
                test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc

                I was wondering whether anyone had any advice for
                us on how we should
                proceed forward? We were thinking about manually
                applying these events
                (via "event apply") where failing that we could
                erase this problematic
                event with "cephfs-journal-tool
                --rank=ceph_backup:0 event splice
                --inode 1101069090357". Is this a good idea? We
                would rather not rebuild
                the entire metadata pool if we could avoid it (once
                was enough for us)
                as this cluster has ~9 PB of data on it.

                Kindest regards,

                Ivan Clayson

                -- 
                Ivan Clayson
                -----------------
                Scientific Computing Officer
                Room 2N249
                Structural Studies
                MRC Laboratory of Molecular Biology
                Francis Crick Ave, Cambridge
                CB2 0QH
                _______________________________________________
                ceph-users mailing list -- ceph-users@xxxxxxx
                To unsubscribe send an email to
                ceph-users-leave@xxxxxxx

            -- 
            Ivan Clayson
            -----------------
            Scientific Computing Officer
            Room 2N249
            Structural Studies
            MRC Laboratory of Molecular Biology
            Francis Crick Ave, Cambridge
            CB2 0QH

        -- 
        Ivan Clayson
        -----------------
        Scientific Computing Officer
        Room 2N249
        Structural Studies
        MRC Laboratory of Molecular Biology
        Francis Crick Ave, Cambridge
        CB2 0QH

    -- 
    Ivan Clayson
    -----------------
    Scientific Computing Officer
    Room 2N249
    Structural Studies
    MRC Laboratory of Molecular Biology
    Francis Crick Ave, Cambridge
    CB2 0QH

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx