Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Mon, 8 Jul 2024 15:04:37 +0100

Hi Dhairya,

Thank you ever so much for having another look at this so quickly. I 
don't think I have any logs similar to the ones you referenced this time 
as my MDSs don't seem to enter the replay stage when they crash (or at 
least don't now after I've thrown the logs away) but those errors do 
crop up in the prior logs I shared when the system first crashed.

Kindest regards,

Ivan

On 08/07/2024 14:08, Dhairya Parmar wrote:
CAUTION: This email originated from outside of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phishing@xxxxxxxxxxxxxxxxx

--

Ugh, something went horribly wrong. I've downloaded the MDS logs that 
contain assertion failure and it looks relevant to this [0]. Do you 
have client logs for this?

The other log that you shared is being downloaded right now, once 
that's done and I'm done going through it, I'll update you.

[0] https://tracker.ceph.com/issues/54546

On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> 
wrote:

    Hi Dhairya,

    Sorry to resurrect this thread again, but we still unfortunately
    have an issue with our filesystem after we attempted to write new
    backups to it.

    We finished the scrub of the filesystem on Friday and ran a repair
    scrub on the 1 directory which had metadata damage. After doing so
    and rebooting, the cluster reported no issues and data was
    accessible again.

    We re-started the backups to run over the weekend and
    unfortunately the filesystem crashed again where the log of the
    failure is here:
    https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
    We ran the backups on kernel mounts of the filesystem without the
    nowsync option this time to avoid the out-of-sync write problems..

    I've tried resetting the journal again after recovering the
    dentries but unfortunately the filesystem is still in a failed
    state despite setting joinable to true. The log of this crash is
    here:
    https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.

    I'm not sure how to proceed as I can't seem to get any MDS to take
    over the first rank. I would like to do a scrub of the filesystem
    and preferably overwrite the troublesome files with the originals
    on the live filesystem. Do you have any advice on how to make the
    filesystem leave its failed state? I have a backup of the journal
    before I reset it so I can roll back if necessary.

    Here are some details about the filesystem at present:

        root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
          cluster:
            id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
            health: HEALTH_ERR
                    1 filesystem is degraded
                    1 large omap objects
                    1 filesystem is offline
                    1 mds daemon damaged
        nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
        flag(s) set
                    1750 pgs not deep-scrubbed in time
                    1612 pgs not scrubbed in time

          services:
            mon: 4 daemons, quorum
        pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
            mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1,
        pebbles-s3, pebbles-s4
            mds: 1/2 daemons up, 3 standby
            osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d);
        10 remapped pgs
                 flags
        nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim

          data:
            volumes: 1/2 healthy, 1 recovering; 1 damaged
            pools:   7 pools, 2177 pgs
            objects: 3.24G objects, 6.7 PiB
            usage:   8.6 PiB used, 14 PiB / 23 PiB avail
            pgs:     11785954/27384310061 objects misplaced (0.043%)
                     2167 active+clean
                     6    active+remapped+backfilling
                     4    active+remapped+backfill_wait

        ceph_backup - 0 clients
        ===========
        RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
         0    failed
                POOL            TYPE     USED  AVAIL
           mds_backup_fs      metadata  1174G  3071G
        ec82_primary_fs_data    data       0   3071G
              ec82pool          data    8085T  4738T
        ceph_archive - 2 clients
        ============
        RANK  STATE      MDS         ACTIVITY     DNS    INOS DIRS   CAPS
         0    active  pebbles-s4  Reqs:    0 /s  13.4k  7105 118      2
                POOL            TYPE     USED  AVAIL
           mds_archive_fs     metadata  5184M  3071G
        ec83_primary_fs_data    data       0   3071G
              ec83pool          data     138T  4307T
        STANDBY MDS
         pebbles-s2
         pebbles-s3
         pebbles-s1
        MDS version: ceph version 17.2.7
        (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
        root@pebbles-s2 11:55 [~]: ceph fs dump
        e2643889
        enable_multiple, ever_enabled_multiple: 1,1
        default compat: compat={},rocompat={},incompat={1=base
        v0.20,2=client writeable ranges,3=default file layouts on
        dirs,4=dir inode in separate object,5=mds uses versioned
        encoding,6=dirfrag is stored in omap,8=no anchor table,9=file
        layout v2,10=snaprealm v2}
        legacy client fscid: 1

        Filesystem 'ceph_backup' (1)
        fs_name    ceph_backup
        epoch    2643888
        flags    12 joinable allow_snaps allow_multimds_snaps
        created    2023-05-19T12:52:36.302135+0100
        modified    2024-07-08T11:17:55.437861+0100
        tableserver    0
        root    0
        session_timeout    60
        session_autoclose    300
        max_file_size    109934182400000
        required_client_features    {}
        last_failure    0
        last_failure_osd_epoch    494515
        compat    compat={},rocompat={},incompat={1=base
        v0.20,2=client writeable ranges,3=default file layouts on
        dirs,4=dir inode in separate object,5=mds uses versioned
        encoding,6=dirfrag is stored in omap,7=mds uses inline
        data,8=no anchor table,9=file layout v2,10=snaprealm v2}
        max_mds    1
        in    0
        up    {}
        failed
        damaged    0
        stopped
        data_pools    [6,3]
        metadata_pool    2
        inline_data    disabled
        balancer
        standby_count_wanted    1

    Kindest regards,

    Ivan

    On 28/06/2024 15:17, Dhairya Parmar wrote:
    CAUTION: This email originated from outside of the LMB:
    *.-dparmar@xxxxxxxxxx-.*
    Do not click links or open attachments unless you recognize the
    sender and know the content is safe.
    If you think this is a phishing email, please forward it to
    phishing@xxxxxxxxxxxxxxxxx

    --

    On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson
    <ivan@xxxxxxxxxxxxxxxxx> wrote:

        Hi Dhairya,

        I would be more than happy to share our corrupted journal.
        Has the host key changed for drop.ceph.com
        <http://drop.ceph.com>? The fingerprint I'm being sent is
        7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is
        different to the one in our
        /usr/share/ceph/known_hosts_drop.ceph.com
        <http://known_hosts_drop.ceph.com>.

    Ah, strange. Let me get in touch with folks who might know about
    this, will revert back to you ASAP

        Thank you for your advice as well. We've reset our MDS'
        journal and are currently in the process of a full filesystem
        scrub which understandably is taking quite a bit of time but
        seems to be progressing through the objects fine.

    YAY!

        Thank you ever so much for all your help and please do feel
        free to follow up with us if you would like any further
        details about our crash!

    Glad to hear it went well, this bug is being worked on with high
    priority and once the patch is ready, it will be backported.

    The root cause of this issue is the `nowsync` (async dirops)
    being enabled by default with kclient [0]. This feature allows
    asynchronous creation and deletion of files, optimizing
    performance by avoiding round-trip latency for these system
    calls. However, in very rare cases (like yours :D), it can affect
    the system's consistency and stability hence if this kind of
    optimization is not a priority for your workload, I recommend
    turning it off by switching the mount points to `wsync` and also
    set the MDS config `mds_client_delegate_inos_pct` to `0` so that
    you don't end up in this situation again (until the bug fix
    arrives :)).

    [0]
    https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902

        Kindest regards,

        Ivan

        On 27/06/2024 12:39, Dhairya Parmar wrote:
        CAUTION: This email originated from outside of the LMB:
        *.-dparmar@xxxxxxxxxx-.*
        Do not click links or open attachments unless you recognize
        the sender and know the content is safe.
        If you think this is a phishing email, please forward it to
        phishing@xxxxxxxxxxxxxxxxx

        --

        Hi Ivan,

        The solution (which has been successful for us in the past)
        is to reset the journal. This would bring the fs back online
        and return the MDSes to a stable state, but some data would
        be lost—the data in the journal that hasn't been flushed to
        the backing store would be gone. Therefore, you should try
        to flush out as much journal data as possible before
        resetting the journal.

        Here are the steps for this entire process:

        1) Bring the FS offline
        $ ceph fs fail <fs_name>

        2) Recover dentries from journal (run it with every MDS Rank)
        $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event
        recover_dentries summary

        3) Reset the journal (again with every MDS Rank)
        $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> journal reset

        4) Bring the FS online
        $ cephfs fs set <fs_name> joinable true

        5) Restart the MDSes

        6) Perform scrub to ensure consistency of fs
        $ ceph tell mds.<fs_name>:0 scrub start <path> [scrubopts] [tag]
        # you could try a recursive scrub maybe `ceph tell
        mds.<fs_name>:0 scrub start / recursive`

        Some important notes to keep in mind:
        * Recovering dentries will take time (generally, rank 0 is
        the most time-consuming, but the rest should be quick).
        * cephfs-journal-tool and metadata OSDs are bound to use a
        significant CPU percentage. This is because
        cephfs-journal-tool has to swig the journal data and flush
        it out to the backing store, which also makes the metadata
        operations go rampant, resulting in OSDs taking a
        significant percentage of CPU.

        Do let me know how this goes.

        On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson
        <ivan@xxxxxxxxxxxxxxxxx> wrote:

            Hi Dhairya,

            We can induce the crash by simply restarting the MDS and
            the crash seems to happen when an MDS goes from
            up:standby to up:replay. The MDS works through a few
            files in the log before eventually crashing where I've
            included the logs for this here (this is after I
            imported the backed up journal which I hope was
            successful but please let me know if you suspect it
            wasn't!):
            https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log

            With respect to the client logs, are you referring to
            the clients who are writing to the filesystem? We don't
            typically run them in any sort of debug mode and we have
            quite a few machines running our backup system but we
            can look an hour or so before the first MDS crash
            (though I don't know if this is when the de-sync
            occurred). Here are some MDS logs with regards to the
            initial crash on Saturday morning though which may be
            helpful:

                   -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10
                monclient: tick
                   -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10
                monclient: _check_auth_rotating have uptodate
                secrets (they expire after
                2024-06-22T05:41:13.091556+0100)
                   -57> 2024-06-22T05:41:43.208+0100 7f184de84700  1
                mds.pebbles-s2 Updating MDS map to version 2529650
                from mon.3
                   -56> 2024-06-22T05:41:43.208+0100 7f184de84700  4
                mds.0.purge_queue operator():  data pool 6 not found
                in OSDMap
                   -55> 2024-06-22T05:41:43.208+0100 7f184de84700  4
                mds.0.purge_queue operator():  data pool 3 not found
                in OSDMap
                   -54> 2024-06-22T05:41:43.209+0100 7f184de84700  5
                asok(0x5592e7968000) register_command
                objecter_requests hook 0x5592e78f8800
                   -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10
                monclient: _renew_subs
                   -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10
                monclient: _send_mon_message to mon.pebbles-s4 at
                v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0>
                   -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10
                log_channel(cluster) update_config to_monitors: true
                to_syslog: false syslog_facility: prio: info
                to_graylog: false graylog_host: 127.0.0.1
                graylog_port: 12201)
                   -50> 2024-06-22T05:41:43.209+0100 7f184de84700  4
                mds.0.purge_queue operator():  data pool 6 not found
                in OSDMap
                   -49> 2024-06-22T05:41:43.209+0100 7f184de84700  4
                mds.0.purge_queue operator():  data pool 3 not found
                in OSDMap
                   -48> 2024-06-22T05:41:43.209+0100 7f184de84700  4
                mds.0.0 apply_blocklist: killed 0, blocklisted
                sessions (0 blocklist entries, 0)
                   -47> 2024-06-22T05:41:43.209+0100 7f184de84700  1
                mds.0.2529650 handle_mds_map i am now mds.0.2529650
                   -46> 2024-06-22T05:41:43.209+0100 7f184de84700  1
                mds.0.2529650 handle_mds_map state change up:standby
                --> up:replay
                   -45> 2024-06-22T05:41:43.209+0100 7f184de84700  5
                mds.beacon.pebbles-s2 set_want_state: up:standby ->
                up:replay
                   -44> 2024-06-22T05:41:43.209+0100 7f184de84700  1
                mds.0.2529650 replay_start
                   -43> 2024-06-22T05:41:43.209+0100 7f184de84700  1
                mds.0.2529650  waiting for osdmap 473739 (which
                blocklists prior instance)
                   -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10
                monclient: _send_mon_message to mon.pebbles-s4 at
                v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0>
                   -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700  2
                mds.0.cache Memory usage:  total 299012, rss 37624,
                heap 182556, baseline 182556, 0 / 0 inodes have
                caps, 0 caps, 0 caps per inode
                   -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10
                monclient: _renew_subs
                   -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10
                monclient: _send_mon_message to mon.pebbles-s4 at
                v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0>
                   -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10
                monclient: handle_get_version_reply finishing 1
                version 473739
                   -37> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
                mds.0.2529650 Booting: 0: opening inotable
                   -36> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
                mds.0.2529650 Booting: 0: opening sessionmap
                   -35> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
                mds.0.2529650 Booting: 0: opening mds log
                   -34> 2024-06-22T05:41:43.224+0100 7f1847e78700  5
                mds.0.log open discovering log bounds
                   -33> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
                mds.0.2529650 Booting: 0: opening purge queue (async)
                   -32> 2024-06-22T05:41:43.224+0100 7f1847e78700  4
                mds.0.purge_queue open: opening
                   -31> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
                mds.0.journaler.pq(ro) recover start
                   -30> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
                mds.0.journaler.pq(ro) read_head
                   -29> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
                mds.0.2529650 Booting: 0: loading open file table
                (async)
                   -28> 2024-06-22T05:41:43.224+0100 7f1847e78700  2
                mds.0.2529650 Booting: 0: opening snap table
                   -27> 2024-06-22T05:41:43.224+0100 7f1847677700  4
                mds.0.journalpointer Reading journal pointer
                '400.00000000'
                   -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10
                monclient: get_auth_request con 0x5592e8987000
                auth_method 0
                   -25> 2024-06-22T05:41:43.225+0100 7f1850e8a700 10
                monclient: get_auth_request con 0x5592e8987c00
                auth_method 0
                   -24> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
                mds.0.journaler.pq(ro) _finish_read_head
                loghead(trim 231160676352, expire 231163662875,
                write 231163662875, stream_format 1). probing for
                end of log (from 231163662875)...
                   -23> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
                mds.0.journaler.pq(ro) probing for end of the log
                   -22> 2024-06-22T05:41:43.252+0100 7f1847677700  1
                mds.0.journaler.mdlog(ro) recover start
                   -21> 2024-06-22T05:41:43.252+0100 7f1847677700  1
                mds.0.journaler.mdlog(ro) read_head
                   -20> 2024-06-22T05:41:43.252+0100 7f1847677700  4
                mds.0.log Waiting for journal 0x200 to recover...
                   -19> 2024-06-22T05:41:43.252+0100 7f1850689700 10
                monclient: get_auth_request con 0x5592e8bc6000
                auth_method 0
                   -18> 2024-06-22T05:41:43.253+0100 7f185168b700 10
                monclient: get_auth_request con 0x5592e8bc6800
                auth_method 0
                   -17> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
                mds.0.journaler.mdlog(ro) _finish_read_head
                loghead(trim 90131453181952, expire 90131465778558,
                write 90132009715463, stream_format 1).  probing for
                end of log (from 90132009715463)...
                   -16> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
                mds.0.journaler.mdlog(ro) probing for end of the log
                   -15> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
                mds.0.journaler.mdlog(ro) _finish_probe_end
                write_pos = 90132019384791 (header had
                90132009715463). recovered.
                   -14> 2024-06-22T05:41:43.257+0100 7f1847677700  4
                mds.0.log Journal 0x200 recovered.
                   -13> 2024-06-22T05:41:43.257+0100 7f1847677700  4
                mds.0.log Recovered journal 0x200 in format 1
                   -12> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
                mds.0.journaler.pq(ro) _finish_probe_end write_pos =
                231163662875 (header had 231163662875). recovered.
                   -11> 2024-06-22T05:41:43.273+0100 7f1848e7a700  4
                mds.0.purge_queue operator(): open complete
                   -10> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
                mds.0.journaler.pq(ro) set_writeable
                    -9> 2024-06-22T05:41:43.441+0100 7f1847e78700  2
                mds.0.2529650 Booting: 1: loading/discovering base
                inodes
                    -8> 2024-06-22T05:41:43.441+0100 7f1847e78700  0
                mds.0.cache creating system inode with ino:0x100
                    -7> 2024-06-22T05:41:43.442+0100 7f1847e78700  0
                mds.0.cache creating system inode with ino:0x1
                    -6> 2024-06-22T05:41:43.442+0100 7f1847e78700  2
                mds.0.2529650 Booting: 2: replaying mds log
                    -5> 2024-06-22T05:41:43.442+0100 7f1847e78700  2
                mds.0.2529650 Booting: 2: waiting for purge queue
                recovered
                    -4> 2024-06-22T05:41:44.090+0100 7f184ce82700 10
                monclient: tick
                    -3> 2024-06-22T05:41:44.090+0100 7f184ce82700 10
                monclient: _check_auth_rotating have uptodate
                secrets (they expire after
                2024-06-22T05:41:14.091638+0100)
                    -2> 2024-06-22T05:41:44.210+0100 7f1849e7c700  2
                mds.0.cache Memory usage:  total 588368, rss 308304,
                heap 207132, baseline 182556, 0 / 15149 inodes have
                caps, 0 caps, 0 caps per inode
                    -1> 2024-06-22T05:41:44.642+0100 7f1846675700 -1
                /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                In function 'void interval_set<T, C>::erase(T, T,
                std::function<bool(T, T)>) [with T = inodeno_t; C =
                std::map]' thread 7f1846675700 time
                2024-06-22T05:41:44.643146+0100

                 ceph version 17.2.7
                (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
                (stable)
                 1: (ceph::__ceph_assert_fail(char const*, char
                const*, int, char const*)+0x135) [0x7f18568b64a3]
                 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669)
                [0x7f18568b6669]
                 3: (interval_set<inodeno_t,
                std::map>::erase(inodeno_t, inodeno_t,
                std::function<bool (inodeno_t, inodeno_t)>)+0x2e5)
                [0x5592e5027885]
                 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
                MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
                 5: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
                 6: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
                 7: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
                 8: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
                 9: clone()

                     0> 2024-06-22T05:41:44.643+0100 7f1846675700 -1
                *** Caught signal (Aborted) **
                 in thread 7f1846675700 thread_name:md_log_replay

                 ceph version 17.2.7
                (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
                (stable)
                 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f18558aecf0]
                 2: gsignal()
                 3: abort()
                 4: (ceph::__ceph_assert_fail(char const*, char
                const*, int, char const*)+0x18f) [0x7f18568b64fd]
                 5: /usr/lib64/ceph/libceph-common.so.2(+0x269669)
                [0x7f18568b6669]
                 6: (interval_set<inodeno_t,
                std::map>::erase(inodeno_t, inodeno_t,
                std::function<bool (inodeno_t, inodeno_t)>)+0x2e5)
                [0x5592e5027885]
                 7: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
                MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
                 8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
                 9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
                 10: (MDLog::ReplayThread::entry()+0x11)
                [0x5592e4f6a041]
                 11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
                 12: clone()

            We have a relatively low debug setting normally so I
            don't think many details of the initial crash were
            captured unfortunately and the MDS logs before the above
            (i.e. "-60" and older) are just beacon messages and
            _check_auth_rotating checks.

            I was wondering whether you have any recommendations in
            terms of what actions we could take to bring our
            filesystem back into a working state short of rebuilding
            the entire metadata pool? We are quite keen to bring our
            backup back into service urgently as we currently do not
            have any accessible backups for our Ceph clusters.

            Kindest regards,

            Ivan

            On 25/06/2024 19:18, Dhairya Parmar wrote:
            CAUTION: This email originated from outside of the LMB:
            *.-dparmar@xxxxxxxxxx-.*
            Do not click links or open attachments unless you
            recognize the sender and know the content is safe.
            If you think this is a phishing email, please forward
            it to phishing@xxxxxxxxxxxxxxxxx

            --

            On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson
            <ivan@xxxxxxxxxxxxxxxxx> wrote:

                Hi Dhairya,

                Thank you for your rapid reply. I tried recovering
                the dentries for the file just before the crash I
                mentioned before and then splicing the transactions
                from the journal which seemed to remove that issue
                for that inode but resulted in the MDS crashing on
                the next inode in the journal when performing replay.

            The MDS delegates a range of preallocated inodes (in
            form of a set - interval_set<inodeno_t>
            preallocated_inos) to the clients, so it can be one
            inode that is untracked or some inodes from the range
            or in worst case scenario - ALL, and this is something
            that even the `cephfs-journal-tool` would not be able
            to tell (since we're talking about MDS internals which
            aren't exposed to such tools). That is the reason why
            you see "MDS crashing on the next inode in the journal
            when performing replay".

            An option could be to expose the inode set to some tool
            or asok cmd to identify such inodes ranges, which needs
            to be discussed. For now, we're trying to address this
            in [0], you can follow the discussion there.

            [0] https://tracker.ceph.com/issues/66251

                Removing all the transactions involving the
                directory housing the files that seemed to cause
                these crashes from the journal only caused the MDS
                to fail to even start replay.

                I've rolled back our journal to our original
                version when the crash first happened and the
                entire MDS log for the crash can be found here:
                https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24

            Awesome, this would help us a ton. Apart from this,
            would it be possible to send us client logs?

                Please let us know if you would like any other logs
                file as we can easily induce this crash.

            Since you can easily induce the crash, can you share
            the reproducer please i.e. what all action you take in
            order to hit this?

                Kindest regards,

                Ivan

                On 25/06/2024 09:58, Dhairya Parmar wrote:
                CAUTION: This email originated from outside of the
                LMB:
                *.-dparmar@xxxxxxxxxx-.*
                Do not click links or open attachments unless you
                recognize the sender and know the content is safe.
                If you think this is a phishing email, please
                forward it to phishing@xxxxxxxxxxxxxxxxx

                --

                Hi Ivan,

                This looks to be similar to the issue [0] that
                we're already addressing at [1]. So basically
                there is some out-of-sync event that led the
                client to make use of the inodes that MDS wasn't
                aware of/isn't tracking and hence the crash. It'd
                be really helpful if you can provide us more logs.

                CC @Rishabh Dave <mailto:ridave@xxxxxxxxxx> @Venky
                Shankar <mailto:vshankar@xxxxxxxxxx> @Patrick
                Donnelly <mailto:pdonnell@xxxxxxxxxx> @Xiubo Li
                <mailto:xiubli@xxxxxxxxxx>

                [0] https://tracker.ceph.com/issues/61009
                [1] https://tracker.ceph.com/issues/66251
                --
                ***Dhairya Parmar*

                Associate Software Engineer, CephFS

                <https://www.redhat.com/>IBM, Inc.

                On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson
                <ivan@xxxxxxxxxxxxxxxxx> wrote:

                    Hello,

                    We have been experiencing a serious issue with
                    our CephFS backup cluster
                    running quincy (version 17.2.7) on a
                    RHEL8-derivative Linux kernel
                    (Alma8.9, 4.18.0-513.9.1 kernel) where our
                    MDSes for our filesystem are
                    constantly in a "replay" or "replay(laggy)"
                    state and keep crashing.

                    We have a single MDS filesystem called
                    "ceph_backup" with 2 standby
                    MDSes along with a 2nd unused filesystem
                    "ceph_archive" (this holds
                    little to no data) where we are using our
                    "ceph_backup" filesystem to
                    backup our data and this is the one which is
                    currently broken. The Ceph
                    health outputs currently are:

                        root@pebbles-s1 14:05 [~]: ceph -s
                           cluster:
                             id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
                             health: HEALTH_WARN
                                     1 filesystem is degraded
                    insufficient standby MDS daemons available
                                     1319 pgs not deep-scrubbed in
                    time
                                     1054 pgs not scrubbed in time

                           services:
                             mon: 4 daemons, quorum
                    pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4
                    (age 36m)
                             mgr: pebbles-s2(active, since 36m),
                    standbys: pebbles-s4,
                        pebbles-s3, pebbles-s1
                             mds: 2/2 daemons up
                             osd: 1380 osds: 1380 up (since 29m),
                    1379 in (since 3d); 37
                        remapped pgs

                           data:
                             volumes: 1/2 healthy, 1 recovering
                             pools:   7 pools, 2177 pgs
                             objects: 3.55G objects, 7.0 PiB
                             usage:   8.9 PiB used, 14 PiB / 23
                    PiB avail
                             pgs: 83133528/30006841533 objects
                    misplaced (0.277%)
                                      2090 active+clean
                                      47 active+clean+scrubbing+deep
                                      29 active+remapped+backfilling
                                      8 active+remapped+backfill_wait
                                      2 active+clean+scrubbing
                                      1 active+clean+snaptrim

                           io:
                             recovery: 1.9 GiB/s, 719 objects/s

                        root@pebbles-s1 14:09 [~]: ceph fs status
                        ceph_backup - 0 clients
                        ===========
                        RANK STATE         MDS ACTIVITY   DNS   
                    INOS DIRS CAPS
                          0    replay(laggy) pebbles-s3 0      0
                    0      0
                    POOL            TYPE USED  AVAIL
                            mds_backup_fs metadata  1255G  2780G
                    ec82_primary_fs_data data       0   2780G
                    ec82pool          data 8442T  3044T
                        ceph_archive - 2 clients
                        ============
                        RANK  STATE MDS         ACTIVITY DNS   
                    INOS   DIRS CAPS
                          0    active pebbles-s2  Reqs:    0 /s
                    13.4k  7105    118 2
                    POOL            TYPE USED  AVAIL
                            mds_archive_fs metadata  5184M  2780G
                    ec83_primary_fs_data data       0   2780G
                    ec83pool          data 138T  2767T
                        MDS version: ceph version 17.2.7
                    (b12291d110049b2f35e32e0de30d70e9a4c060d2)
                    quincy (stable)
                        root@pebbles-s1 14:09 [~]: ceph health
                    detail | head
                        HEALTH_WARN 1 filesystem is degraded;
                    insufficient standby MDS
                        daemons available; 1319 pgs not
                    deep-scrubbed in time; 1054 pgs not
                        scrubbed in time
                        [WRN] FS_DEGRADED: 1 filesystem is degraded
                             fs ceph_backup is degraded
                        [WRN] MDS_INSUFFICIENT_STANDBY:
                    insufficient standby MDS daemons
                        available
                             have 0; want 1 more

                    When our cluster first ran after a reboot,
                    Ceph ran through the 2
                    standby MDSes, crashing them all, until it
                    reached the final MDS and is
                    now stuck in this "replay(laggy)" state.
                    Putting our MDSes into
                    debugging mode, we can see that this MDS
                    crashed when replaying the
                    journal for a particular inode (this is the
                    same for all the MDSes and
                    they all crash on the same object):

                        ...
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                    mds.0.journal
                        EMetaBlob.replay for [521,head] had [inode
                    0x1005ba89481
                        [...539,head]
                    /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
                        auth fragtree_t(*^2 00*^3 00000*^
                        4 00001*^3 00010*^4 00011*^4 00100*^4
                    00101*^4 00110*^4 00111*^4
                        01*^3 01000*^4 01001*^3 01010*^4 01011*^3
                    01100*^4 01101*^4 01110*^4
                        01111*^4 10*^3 10000*^4 10001*^4 10010*^4
                    10011*^4 10100*^4 10101*^3
                        10110*^4 10111*^4 11*^6) v10880645 f(v0
                    m2024-06-22
                        T05:41:10.213700+0100 1281276=1281276+0) n(v12
                    rc2024-06-22T05:41:10.213700+0100
                    b1348251683896 1281277=1281276+1)
                        old_inodes=8 (iversion lock) | dirfrag=416
                    dirty=1 0x55770a2bdb80]
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                    mds.0.journal
                        EMetaBlob.replay dir 0x1005ba89481.011011000*
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                    mds.0.journal
                        EMetaBlob.replay updated dir [dir
                    0x1005ba89481.011011000*
                    /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
                        [2,head] auth v=436385 cv=0/0 state=107374182
                        4 f(v0 m2024-06-22T05:41:10.213700+0100
                    2502=2502+0) n(v12
                    rc2024-06-22T05:41:10.213700+0100 b2120744220
                    2502=2502+0)
                        hs=32+33,ss=0+0 dirty=65 | child=1
                    0x55770ebcda80]
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                    mds.0.journal
                        EMetaBlob.replay added (full) [dentry
                    #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
                    ons_ave_Z124.mrc.teberet7.partial [539,head]
                    auth NULL (dversion
                        lock) v=436384 ino=(nil)
                    state=1610612800|bottomlru | dirty=1
                        0x557710444500]
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                    mds.0.journal
                        EMetaBlob.replay added [inode
                    0x1005cd4fe35 [539,head]
                    /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
                    005006_fractions_ave_Z124.mrc.teberet7.partial
                    auth v436384 s=0 n(v0
                        1=1+0) (iversion lock)
                    cr={99995144=0-4194304@538} 0x557710438680]
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                    mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                    mds.0.journal
                        EMetaBlob.replay noting opened inode
                    [inode 0x1005cd4fe35 [539,head]
                    /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
                    6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                    auth
                        v436384 DIRTYPARENT s=0 n(v0 1=1+0)
                    (iversion lock)
                    cr={99995144=0-4194304@538} | dirtyparent=1
                    dirty=1 0x557710438680]
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                    mds.0.journal
                        EMetaBlob.replay inotable tablev 3112837
                    <= table 3112837
                    2024-06-24T13:44:55.563+0100 7f8811c40700 10
                    mds.0.journal
                        EMetaBlob.replay sessionmap v 1560540883,
                    table 1560540882 prealloc
                        [] used 0x1005cd4fe35
                    2024-06-24T13:44:55.563+0100 7f8811c40700 -1
                    /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                        I
                        n function 'void interval_set<T,
                    C>::erase(T, T,
                    std::function<bool(T, T)>) [with T =
                    inodeno_t; C = std::map]'
                        thread 7f8811c40700 time
                    2024-06-24T13:44:55.564315+0100
                    /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                        568: FAILED ceph_assert(p->first <= start)

                          ceph version 17.2.7
                    (b12291d110049b2f35e32e0de30d70e9a4c060d2)
                        quincy (stable)
                          1: (ceph::__ceph_assert_fail(char
                    const*, char const*, int, char
                        const*)+0x135) [0x7f8821e814a3]
                          2:
                    /usr/lib64/ceph/libceph-common.so.2(+0x269669)
                    [0x7f8821e81669]
                          3: (interval_set<inodeno_t,
                    std::map>::erase(inodeno_t, inodeno_t,
                        std::function<bool (inodeno_t,
                    inodeno_t)>)+0x2e5) [0x5576f9bb2885]
                          4: (EMetaBlob::replay(MDSRank*,
                    LogSegment*, int,
                        MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
                          5: (EUpdate::replay(MDSRank*)+0x61)
                    [0x5576f9ebbbd1]
                          6: (MDLog::_replay_thread()+0x7bb)
                    [0x5576f9e4254b]
                          7: (MDLog::ReplayThread::entry()+0x11)
                    [0x5576f9af5041]
                          8: /lib64/libpthread.so.0(+0x81ca)
                    [0x7f8820e6f1ca]
                          9: clone()

                    I've only included a short section of the
                    crash (this is the first trace
                    in the log with regards to the crash with a
                    10/20 debug_mds option). We
                    tried deleting the 0x1005cd4fe35 object from
                    the object store using the
                    "rados" command but this did not allow our MDS
                    to successfully replay.

                     From my understanding the journal seems okay
                    as we didn't run out of
                    space for example on our metadata pool and
                    "cephfs-journal-tool journal
                    inspect" doesn't seem to think there is any
                    damage:

                        root@pebbles-s1 13:58 [~]:
                    cephfs-journal-tool --rank=ceph_backup:0
                        journal inspect
                        Overall journal integrity: OK
                        root@pebbles-s1 14:04 [~]:
                    cephfs-journal-tool --rank=ceph_backup:0
                        event get --inode 1101069090357 summary
                        Events by type:
                           OPEN: 1
                           UPDATE: 3
                        Errors: 0
                        root@pebbles-s1 14:05 [~]:
                    cephfs-journal-tool --rank=ceph_backup:0
                        event get --inode 1101069090357 list
                    2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35
                    UPDATE: (openc)
                    test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                    2024-06-22T05:41:11.203312+0100 0x51f97d59c848
                    UPDATE:
                        (check_inode_max_size)
                    test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                    test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                    2024-06-22T05:41:15.484871+0100 0x51f97e7344cc
                    OPEN:  ()
                    FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                    2024-06-22T05:41:15.484921+0100 0x51f97e73493b
                    UPDATE: (rename)
                    test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                    test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc

                    I was wondering whether anyone had any advice
                    for us on how we should
                    proceed forward? We were thinking about
                    manually applying these events
                    (via "event apply") where failing that we
                    could erase this problematic
                    event with "cephfs-journal-tool
                    --rank=ceph_backup:0 event splice
                    --inode 1101069090357". Is this a good idea?
                    We would rather not rebuild
                    the entire metadata pool if we could avoid it
                    (once was enough for us)
                    as this cluster has ~9 PB of data on it.

                    Kindest regards,

                    Ivan Clayson

                    -- 
                    Ivan Clayson
                    -----------------
                    Scientific Computing Officer
                    Room 2N249
                    Structural Studies
                    MRC Laboratory of Molecular Biology
                    Francis Crick Ave, Cambridge
                    CB2 0QH
                    _______________________________________________
                    ceph-users mailing list -- ceph-users@xxxxxxx
                    To unsubscribe send an email to
                    ceph-users-leave@xxxxxxx

                -- 
                Ivan Clayson
                -----------------
                Scientific Computing Officer
                Room 2N249
                Structural Studies
                MRC Laboratory of Molecular Biology
                Francis Crick Ave, Cambridge
                CB2 0QH

            -- 
            Ivan Clayson
            -----------------
            Scientific Computing Officer
            Room 2N249
            Structural Studies
            MRC Laboratory of Molecular Biology
            Francis Crick Ave, Cambridge
            CB2 0QH

        -- 
        Ivan Clayson
        -----------------
        Scientific Computing Officer
        Room 2N249
        Structural Studies
        MRC Laboratory of Molecular Biology
        Francis Crick Ave, Cambridge
        CB2 0QH

    -- 
    Ivan Clayson
    -----------------
    Scientific Computing Officer
    Room 2N249
    Structural Studies
    MRC Laboratory of Molecular Biology
    Francis Crick Ave, Cambridge
    CB2 0QH

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx