Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Tue, 9 Jul 2024 11:14:09 +0100

Hi Dhairya,

I would be more than happy to try and give as many details as possible 
but the slack channel is private and requires my email to have an 
account/ access to it.

Wouldn't taking the discussion about this error to a private channel 
also stop other users who experience this error from learning about how 
and why this happened as  well as possibly not be able to view the 
solution? Would it not be possible to discuss this more publicly for the 
benefit of the other users on the mailing list?

Kindest regards,

Ivan

On 09/07/2024 10:44, Dhairya Parmar wrote:
CAUTION: This email originated from outside of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phishing@xxxxxxxxxxxxxxxxx

--

Hey Ivan,

This is a relatively new MDS crash, so this would require some 
investigation but I was instructed to recommend disaster-recovery 
steps [0] (except session reset) to you to get the FS up again.

This crash is being discussed on upstream CephFS slack channel [1] 
with @Venky Shankar <mailto:vshankar@xxxxxxxxxx> and other CephFS 
devs. I'd encourage you to join the conversation, we can discuss this 
in detail and maybe go through the incident step by step which should 
help analyse the crash better.

[0] 
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
[1] https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519

On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> 
wrote:

    Hi Dhairya,

    Thank you ever so much for having another look at this so quickly.
    I don't think I have any logs similar to the ones you referenced
    this time as my MDSs don't seem to enter the replay stage when
    they crash (or at least don't now after I've thrown the logs away)
    but those errors do crop up in the prior logs I shared when the
    system first crashed.

    Kindest regards,

    Ivan

    On 08/07/2024 14:08, Dhairya Parmar wrote:
    CAUTION: This email originated from outside of the LMB:
    *.-dparmar@xxxxxxxxxx-.*
    Do not click links or open attachments unless you recognize the
    sender and know the content is safe.
    If you think this is a phishing email, please forward it to
    phishing@xxxxxxxxxxxxxxxxx

    --

    Ugh, something went horribly wrong. I've downloaded the MDS logs
    that contain assertion failure and it looks relevant to this [0].
    Do you have client logs for this?

    The other log that you shared is being downloaded right now, once
    that's done and I'm done going through it, I'll update you.

    [0] https://tracker.ceph.com/issues/54546

    On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson
    <ivan@xxxxxxxxxxxxxxxxx> wrote:

        Hi Dhairya,

        Sorry to resurrect this thread again, but we still
        unfortunately have an issue with our filesystem after we
        attempted to write new backups to it.

        We finished the scrub of the filesystem on Friday and ran a
        repair scrub on the 1 directory which had metadata damage.
        After doing so and rebooting, the cluster reported no issues
        and data was accessible again.

        We re-started the backups to run over the weekend and
        unfortunately the filesystem crashed again where the log of
        the failure is here:
        https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
        We ran the backups on kernel mounts of the filesystem without
        the nowsync option this time to avoid the out-of-sync write
        problems..

        I've tried resetting the journal again after recovering the
        dentries but unfortunately the filesystem is still in a
        failed state despite setting joinable to true. The log of
        this crash is here:
        https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.

        I'm not sure how to proceed as I can't seem to get any MDS to
        take over the first rank. I would like to do a scrub of the
        filesystem and preferably overwrite the troublesome files
        with the originals on the live filesystem. Do you have any
        advice on how to make the filesystem leave its failed state?
        I have a backup of the journal before I reset it so I can
        roll back if necessary.

        Here are some details about the filesystem at present:

            root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
              cluster:
                id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
                health: HEALTH_ERR
                        1 filesystem is degraded
                        1 large omap objects
                        1 filesystem is offline
                        1 mds daemon damaged
            nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
            flag(s) set
                        1750 pgs not deep-scrubbed in time
                        1612 pgs not scrubbed in time

              services:
                mon: 4 daemons, quorum
            pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
                mgr: pebbles-s2(active, since 77m), standbys:
            pebbles-s1, pebbles-s3, pebbles-s4
                mds: 1/2 daemons up, 3 standby
                osd: 1380 osds: 1380 up (since 76m), 1379 in (since
            10d); 10 remapped pgs
                     flags
            nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim

              data:
                volumes: 1/2 healthy, 1 recovering; 1 damaged
                pools:   7 pools, 2177 pgs
                objects: 3.24G objects, 6.7 PiB
                usage:   8.6 PiB used, 14 PiB / 23 PiB avail
                pgs:     11785954/27384310061 objects misplaced (0.043%)
                         2167 active+clean
                         6    active+remapped+backfilling
                         4    active+remapped+backfill_wait

            ceph_backup - 0 clients
            ===========
            RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS CAPS
             0 failed
                    POOL            TYPE     USED  AVAIL
               mds_backup_fs      metadata  1174G  3071G
            ec82_primary_fs_data    data       0   3071G
                  ec82pool          data    8085T  4738T
            ceph_archive - 2 clients
            ============
            RANK  STATE      MDS         ACTIVITY     DNS INOS  
            DIRS   CAPS
             0    active  pebbles-s4  Reqs:    0 /s  13.4k 7105   
            118      2
                    POOL            TYPE     USED  AVAIL
               mds_archive_fs     metadata  5184M  3071G
            ec83_primary_fs_data    data       0   3071G
                  ec83pool          data     138T  4307T
            STANDBY MDS
             pebbles-s2
             pebbles-s3
             pebbles-s1
            MDS version: ceph version 17.2.7
            (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
            root@pebbles-s2 11:55 [~]: ceph fs dump
            e2643889
            enable_multiple, ever_enabled_multiple: 1,1
            default compat: compat={},rocompat={},incompat={1=base
            v0.20,2=client writeable ranges,3=default file layouts on
            dirs,4=dir inode in separate object,5=mds uses versioned
            encoding,6=dirfrag is stored in omap,8=no anchor
            table,9=file layout v2,10=snaprealm v2}
            legacy client fscid: 1

            Filesystem 'ceph_backup' (1)
            fs_name    ceph_backup
            epoch    2643888
            flags    12 joinable allow_snaps allow_multimds_snaps
            created    2023-05-19T12:52:36.302135+0100
            modified    2024-07-08T11:17:55.437861+0100
            tableserver    0
            root    0
            session_timeout    60
            session_autoclose    300
            max_file_size    109934182400000
            required_client_features    {}
            last_failure    0
            last_failure_osd_epoch    494515
            compat    compat={},rocompat={},incompat={1=base
            v0.20,2=client writeable ranges,3=default file layouts on
            dirs,4=dir inode in separate object,5=mds uses versioned
            encoding,6=dirfrag is stored in omap,7=mds uses inline
            data,8=no anchor table,9=file layout v2,10=snaprealm v2}
            max_mds    1
            in    0
            up    {}
            failed
            damaged    0
            stopped
            data_pools    [6,3]
            metadata_pool    2
            inline_data    disabled
            balancer
            standby_count_wanted    1

        Kindest regards,

        Ivan

        On 28/06/2024 15:17, Dhairya Parmar wrote:
        CAUTION: This email originated from outside of the LMB:
        *.-dparmar@xxxxxxxxxx-.*
        Do not click links or open attachments unless you recognize
        the sender and know the content is safe.
        If you think this is a phishing email, please forward it to
        phishing@xxxxxxxxxxxxxxxxx

        --

        On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson
        <ivan@xxxxxxxxxxxxxxxxx> wrote:

            Hi Dhairya,

            I would be more than happy to share our corrupted
            journal. Has the host key changed for drop.ceph.com
            <http://drop.ceph.com>? The fingerprint I'm being sent
            is 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is
            different to the one in our
            /usr/share/ceph/known_hosts_drop.ceph.com
            <http://known_hosts_drop.ceph.com>.

        Ah, strange. Let me get in touch with folks who might know
        about this, will revert back to you ASAP

            Thank you for your advice as well. We've reset our MDS'
            journal and are currently in the process of a full
            filesystem scrub which understandably is taking quite a
            bit of time but seems to be progressing through the
            objects fine.

        YAY!

            Thank you ever so much for all your help and please do
            feel free to follow up with us if you would like any
            further details about our crash!

        Glad to hear it went well, this bug is being worked on with
        high priority and once the patch is ready, it will be
        backported.

        The root cause of this issue is the `nowsync` (async dirops)
        being enabled by default with kclient [0]. This feature
        allows asynchronous creation and deletion of files,
        optimizing performance by avoiding round-trip latency for
        these system calls. However, in very rare cases (like yours
        :D), it can affect the system's consistency and stability
        hence if this kind of optimization is not a priority for
        your workload, I recommend turning it off by switching
        the mount points to `wsync` and also set the MDS config
        `mds_client_delegate_inos_pct` to `0` so that you don't end
        up in this situation again (until the bug fix arrives :)).

        [0]
        https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902

            Kindest regards,

            Ivan

            On 27/06/2024 12:39, Dhairya Parmar wrote:
            CAUTION: This email originated from outside of the LMB:
            *.-dparmar@xxxxxxxxxx-.*
            Do not click links or open attachments unless you
            recognize the sender and know the content is safe.
            If you think this is a phishing email, please forward
            it to phishing@xxxxxxxxxxxxxxxxx

            --

            Hi Ivan,

            The solution (which has been successful for us in the
            past) is to reset the journal. This would bring the fs
            back online and return the MDSes to a stable state, but
            some data would be lost—the data in the journal that
            hasn't been flushed to the backing store would be gone.
            Therefore, you should try to flush out as much journal
            data as possible before resetting the journal.

            Here are the steps for this entire process:

            1) Bring the FS offline
            $ ceph fs fail <fs_name>

            2) Recover dentries from journal (run it with every MDS
            Rank)
            $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event
            recover_dentries summary

            3) Reset the journal (again with every MDS Rank)
            $ cephfs-journal-tool --rank=<fs_name>:<mds-rank>
            journal reset

            4) Bring the FS online
            $ cephfs fs set <fs_name> joinable true

            5) Restart the MDSes

            6) Perform scrub to ensure consistency of fs
            $ ceph tell mds.<fs_name>:0 scrub start <path>
            [scrubopts] [tag]
            # you could try a recursive scrub maybe `ceph tell
            mds.<fs_name>:0 scrub start / recursive`

            Some important notes to keep in mind:
            * Recovering dentries will take time (generally, rank 0
            is the most time-consuming, but the rest should be quick).
            * cephfs-journal-tool and metadata OSDs are bound to
            use a significant CPU percentage. This is because
            cephfs-journal-tool has to swig the journal data and
            flush it out to the backing store, which also makes the
            metadata operations go rampant, resulting in OSDs
            taking a significant percentage of CPU.

            Do let me know how this goes.

            On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson
            <ivan@xxxxxxxxxxxxxxxxx> wrote:

                Hi Dhairya,

                We can induce the crash by simply restarting the
                MDS and the crash seems to happen when an MDS goes
                from up:standby to up:replay. The MDS works through
                a few files in the log before eventually crashing
                where I've included the logs for this here (this is
                after I imported the backed up journal which I hope
                was successful but please let me know if you
                suspect it wasn't!):
                https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log

                With respect to the client logs, are you referring
                to the clients who are writing to the filesystem?
                We don't typically run them in any sort of debug
                mode and we have quite a few machines running our
                backup system but we can look an hour or so before
                the first MDS crash (though I don't know if this is
                when the de-sync occurred). Here are some MDS logs
                with regards to the initial crash on Saturday
                morning though which may be helpful:

                       -59> 2024-06-22T05:41:43.090+0100
                    7f184ce82700 10 monclient: tick
                       -58> 2024-06-22T05:41:43.090+0100
                    7f184ce82700 10 monclient: _check_auth_rotating
                    have uptodate secrets (they expire after
                    2024-06-22T05:41:13.091556+0100)
                       -57> 2024-06-22T05:41:43.208+0100
                    7f184de84700  1 mds.pebbles-s2 Updating MDS map
                    to version 2529650 from mon.3
                       -56> 2024-06-22T05:41:43.208+0100
                    7f184de84700  4 mds.0.purge_queue operator():
                    data pool 6 not found in OSDMap
                       -55> 2024-06-22T05:41:43.208+0100
                    7f184de84700  4 mds.0.purge_queue operator():
                    data pool 3 not found in OSDMap
                       -54> 2024-06-22T05:41:43.209+0100
                    7f184de84700  5 asok(0x5592e7968000)
                    register_command objecter_requests hook
                    0x5592e78f8800
                       -53> 2024-06-22T05:41:43.209+0100
                    7f184de84700 10 monclient: _renew_subs
                       -52> 2024-06-22T05:41:43.209+0100
                    7f184de84700 10 monclient: _send_mon_message to
                    mon.pebbles-s4 at v2:10.1.5.134:3300/0
                    <http://10.1.5.134:3300/0>
                       -51> 2024-06-22T05:41:43.209+0100
                    7f184de84700 10 log_channel(cluster)
                    update_config to_monitors: true to_syslog:
                    false syslog_facility:  prio: info to_graylog:
                    false graylog_host: 127.0.0.1 graylog_port: 12201)
                       -50> 2024-06-22T05:41:43.209+0100
                    7f184de84700  4 mds.0.purge_queue operator():
                    data pool 6 not found in OSDMap
                       -49> 2024-06-22T05:41:43.209+0100
                    7f184de84700  4 mds.0.purge_queue operator():
                    data pool 3 not found in OSDMap
                       -48> 2024-06-22T05:41:43.209+0100
                    7f184de84700  4 mds.0.0 apply_blocklist: killed
                    0, blocklisted sessions (0 blocklist entries, 0)
                       -47> 2024-06-22T05:41:43.209+0100
                    7f184de84700  1 mds.0.2529650 handle_mds_map i
                    am now mds.0.2529650
                       -46> 2024-06-22T05:41:43.209+0100
                    7f184de84700  1 mds.0.2529650 handle_mds_map
                    state change up:standby --> up:replay
                       -45> 2024-06-22T05:41:43.209+0100
                    7f184de84700  5 mds.beacon.pebbles-s2
                    set_want_state: up:standby -> up:replay
                       -44> 2024-06-22T05:41:43.209+0100
                    7f184de84700  1 mds.0.2529650 replay_start
                       -43> 2024-06-22T05:41:43.209+0100
                    7f184de84700  1 mds.0.2529650 waiting for
                    osdmap 473739 (which blocklists prior instance)
                       -42> 2024-06-22T05:41:43.209+0100
                    7f184de84700 10 monclient: _send_mon_message to
                    mon.pebbles-s4 at v2:10.1.5.134:3300/0
                    <http://10.1.5.134:3300/0>
                       -41> 2024-06-22T05:41:43.209+0100
                    7f1849e7c700  2 mds.0.cache Memory usage: 
                    total 299012, rss 37624, heap 182556, baseline
                    182556, 0 / 0 inodes have caps, 0 caps, 0 caps
                    per inode
                       -40> 2024-06-22T05:41:43.224+0100
                    7f184de84700 10 monclient: _renew_subs
                       -39> 2024-06-22T05:41:43.224+0100
                    7f184de84700 10 monclient: _send_mon_message to
                    mon.pebbles-s4 at v2:10.1.5.134:3300/0
                    <http://10.1.5.134:3300/0>
                       -38> 2024-06-22T05:41:43.224+0100
                    7f184de84700 10 monclient:
                    handle_get_version_reply finishing 1 version 473739
                       -37> 2024-06-22T05:41:43.224+0100
                    7f1847e78700  2 mds.0.2529650 Booting: 0:
                    opening inotable
                       -36> 2024-06-22T05:41:43.224+0100
                    7f1847e78700  2 mds.0.2529650 Booting: 0:
                    opening sessionmap
                       -35> 2024-06-22T05:41:43.224+0100
                    7f1847e78700  2 mds.0.2529650 Booting: 0:
                    opening mds log
                       -34> 2024-06-22T05:41:43.224+0100
                    7f1847e78700  5 mds.0.log open discovering log
                    bounds
                       -33> 2024-06-22T05:41:43.224+0100
                    7f1847e78700  2 mds.0.2529650 Booting: 0:
                    opening purge queue (async)
                       -32> 2024-06-22T05:41:43.224+0100
                    7f1847e78700  4 mds.0.purge_queue open: opening
                       -31> 2024-06-22T05:41:43.224+0100
                    7f1847e78700  1 mds.0.journaler.pq(ro) recover
                    start
                       -30> 2024-06-22T05:41:43.224+0100
                    7f1847e78700  1 mds.0.journaler.pq(ro) read_head
                       -29> 2024-06-22T05:41:43.224+0100
                    7f1847e78700  2 mds.0.2529650 Booting: 0:
                    loading open file table (async)
                       -28> 2024-06-22T05:41:43.224+0100
                    7f1847e78700  2 mds.0.2529650 Booting: 0:
                    opening snap table
                       -27> 2024-06-22T05:41:43.224+0100
                    7f1847677700  4 mds.0.journalpointer Reading
                    journal pointer '400.00000000'
                       -26> 2024-06-22T05:41:43.224+0100
                    7f1850689700 10 monclient: get_auth_request con
                    0x5592e8987000 auth_method 0
                       -25> 2024-06-22T05:41:43.225+0100
                    7f1850e8a700 10 monclient: get_auth_request con
                    0x5592e8987c00 auth_method 0
                       -24> 2024-06-22T05:41:43.252+0100
                    7f1848e7a700  1 mds.0.journaler.pq(ro)
                    _finish_read_head loghead(trim 231160676352,
                    expire 231163662875, write 231163662875,
                    stream_format 1).  probing for end of log (from
                    231163662875)...
                       -23> 2024-06-22T05:41:43.252+0100
                    7f1848e7a700  1 mds.0.journaler.pq(ro) probing
                    for end of the log
                       -22> 2024-06-22T05:41:43.252+0100
                    7f1847677700  1 mds.0.journaler.mdlog(ro)
                    recover start
                       -21> 2024-06-22T05:41:43.252+0100
                    7f1847677700  1 mds.0.journaler.mdlog(ro) read_head
                       -20> 2024-06-22T05:41:43.252+0100
                    7f1847677700  4 mds.0.log Waiting for journal
                    0x200 to recover...
                       -19> 2024-06-22T05:41:43.252+0100
                    7f1850689700 10 monclient: get_auth_request con
                    0x5592e8bc6000 auth_method 0
                       -18> 2024-06-22T05:41:43.253+0100
                    7f185168b700 10 monclient: get_auth_request con
                    0x5592e8bc6800 auth_method 0
                       -17> 2024-06-22T05:41:43.257+0100
                    7f1847e78700  1 mds.0.journaler.mdlog(ro)
                    _finish_read_head loghead(trim 90131453181952,
                    expire 90131465778558, write 90132009715463,
                    stream_format 1).  probing for end of log (from
                    90132009715463)...
                       -16> 2024-06-22T05:41:43.257+0100
                    7f1847e78700  1 mds.0.journaler.mdlog(ro)
                    probing for end of the log
                       -15> 2024-06-22T05:41:43.257+0100
                    7f1847e78700  1 mds.0.journaler.mdlog(ro)
                    _finish_probe_end write_pos = 90132019384791
                    (header had 90132009715463). recovered.
                       -14> 2024-06-22T05:41:43.257+0100
                    7f1847677700  4 mds.0.log Journal 0x200 recovered.
                       -13> 2024-06-22T05:41:43.257+0100
                    7f1847677700  4 mds.0.log Recovered journal
                    0x200 in format 1
                       -12> 2024-06-22T05:41:43.273+0100
                    7f1848e7a700  1 mds.0.journaler.pq(ro)
                    _finish_probe_end write_pos = 231163662875
                    (header had 231163662875). recovered.
                       -11> 2024-06-22T05:41:43.273+0100
                    7f1848e7a700  4 mds.0.purge_queue operator():
                    open complete
                       -10> 2024-06-22T05:41:43.273+0100
                    7f1848e7a700  1 mds.0.journaler.pq(ro)
                    set_writeable
                        -9> 2024-06-22T05:41:43.441+0100
                    7f1847e78700  2 mds.0.2529650 Booting: 1:
                    loading/discovering base inodes
                        -8> 2024-06-22T05:41:43.441+0100
                    7f1847e78700  0 mds.0.cache creating system
                    inode with ino:0x100
                        -7> 2024-06-22T05:41:43.442+0100
                    7f1847e78700  0 mds.0.cache creating system
                    inode with ino:0x1
                        -6> 2024-06-22T05:41:43.442+0100
                    7f1847e78700  2 mds.0.2529650 Booting: 2:
                    replaying mds log
                        -5> 2024-06-22T05:41:43.442+0100
                    7f1847e78700  2 mds.0.2529650 Booting: 2:
                    waiting for purge queue recovered
                        -4> 2024-06-22T05:41:44.090+0100
                    7f184ce82700 10 monclient: tick
                        -3> 2024-06-22T05:41:44.090+0100
                    7f184ce82700 10 monclient: _check_auth_rotating
                    have uptodate secrets (they expire after
                    2024-06-22T05:41:14.091638+0100)
                        -2> 2024-06-22T05:41:44.210+0100
                    7f1849e7c700  2 mds.0.cache Memory usage: 
                    total 588368, rss 308304, heap 207132, baseline
                    182556, 0 / 15149 inodes have caps, 0 caps, 0
                    caps per inode
                        -1> 2024-06-22T05:41:44.642+0100
                    7f1846675700 -1
                    /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                    In function 'void interval_set<T, C>::erase(T,
                    T, std::function<bool(T, T)>) [with T =
                    inodeno_t; C = std::map]' thread 7f1846675700
                    time 2024-06-22T05:41:44.643146+0100

                     ceph version 17.2.7
                    (b12291d110049b2f35e32e0de30d70e9a4c060d2)
                    quincy (stable)
                     1: (ceph::__ceph_assert_fail(char const*, char
                    const*, int, char const*)+0x135) [0x7f18568b64a3]
                     2:
                    /usr/lib64/ceph/libceph-common.so.2(+0x269669)
                    [0x7f18568b6669]
                     3: (interval_set<inodeno_t,
                    std::map>::erase(inodeno_t, inodeno_t,
                    std::function<bool (inodeno_t,
                    inodeno_t)>)+0x2e5) [0x5592e5027885]
                     4: (EMetaBlob::replay(MDSRank*, LogSegment*,
                    int, MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
                     5: (EUpdate::replay(MDSRank*)+0x61)
                    [0x5592e5330bd1]
                     6: (MDLog::_replay_thread()+0x7bb)
                    [0x5592e52b754b]
                     7: (MDLog::ReplayThread::entry()+0x11)
                    [0x5592e4f6a041]
                     8: /lib64/libpthread.so.0(+0x81ca)
                    [0x7f18558a41ca]
                     9: clone()

                         0> 2024-06-22T05:41:44.643+0100
                    7f1846675700 -1 *** Caught signal (Aborted) **
                     in thread 7f1846675700 thread_name:md_log_replay

                     ceph version 17.2.7
                    (b12291d110049b2f35e32e0de30d70e9a4c060d2)
                    quincy (stable)
                     1: /lib64/libpthread.so.0(+0x12cf0)
                    [0x7f18558aecf0]
                     2: gsignal()
                     3: abort()
                     4: (ceph::__ceph_assert_fail(char const*, char
                    const*, int, char const*)+0x18f) [0x7f18568b64fd]
                     5:
                    /usr/lib64/ceph/libceph-common.so.2(+0x269669)
                    [0x7f18568b6669]
                     6: (interval_set<inodeno_t,
                    std::map>::erase(inodeno_t, inodeno_t,
                    std::function<bool (inodeno_t,
                    inodeno_t)>)+0x2e5) [0x5592e5027885]
                     7: (EMetaBlob::replay(MDSRank*, LogSegment*,
                    int, MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
                     8: (EUpdate::replay(MDSRank*)+0x61)
                    [0x5592e5330bd1]
                     9: (MDLog::_replay_thread()+0x7bb)
                    [0x5592e52b754b]
                     10: (MDLog::ReplayThread::entry()+0x11)
                    [0x5592e4f6a041]
                     11: /lib64/libpthread.so.0(+0x81ca)
                    [0x7f18558a41ca]
                     12: clone()

                We have a relatively low debug setting normally so
                I don't think many details of the initial crash
                were captured unfortunately and the MDS logs before
                the above (i.e. "-60" and older) are just beacon
                messages and _check_auth_rotating checks.

                I was wondering whether you have any
                recommendations in terms of what actions we could
                take to bring our filesystem back into a working
                state short of rebuilding the entire metadata pool?
                We are quite keen to bring our backup back into
                service urgently as we currently do not have any
                accessible backups for our Ceph clusters.

                Kindest regards,

                Ivan

                On 25/06/2024 19:18, Dhairya Parmar wrote:
                CAUTION: This email originated from outside of the
                LMB:
                *.-dparmar@xxxxxxxxxx-.*
                Do not click links or open attachments unless you
                recognize the sender and know the content is safe.
                If you think this is a phishing email, please
                forward it to phishing@xxxxxxxxxxxxxxxxx

                --

                On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson
                <ivan@xxxxxxxxxxxxxxxxx> wrote:

                    Hi Dhairya,

                    Thank you for your rapid reply. I tried
                    recovering the dentries for the file just
                    before the crash I mentioned before and then
                    splicing the transactions from the journal
                    which seemed to remove that issue for that
                    inode but resulted in the MDS crashing on the
                    next inode in the journal when performing replay.

                The MDS delegates a range of preallocated inodes
                (in form of a set - interval_set<inodeno_t>
                preallocated_inos) to the clients, so it can be
                one inode that is untracked or some inodes from
                the range or in worst case scenario - ALL, and
                this is something that even the
                `cephfs-journal-tool` would not be able to tell
                (since we're talking about MDS internals which
                aren't exposed to such tools). That is the reason
                why you see "MDS crashing on the next inode in the
                journal when performing replay".

                An option could be to expose the inode set to some
                tool or asok cmd to identify such inodes ranges,
                which needs to be discussed. For now, we're trying
                to address this in [0], you can follow the
                discussion there.

                [0] https://tracker.ceph.com/issues/66251

                    Removing all the transactions involving the
                    directory housing the files that seemed to
                    cause these crashes from the journal only
                    caused the MDS to fail to even start replay.

                    I've rolled back our journal to our original
                    version when the crash first happened and the
                    entire MDS log for the crash can be found
                    here:
                    https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24

                Awesome, this would help us a ton. Apart from
                this, would it be possible to send us client logs?

                    Please let us know if you would like any other
                    logs file as we can easily induce this crash.

                Since you can easily induce the crash, can you
                share the reproducer please i.e. what all action
                you take in order to hit this?

                    Kindest regards,

                    Ivan

                    On 25/06/2024 09:58, Dhairya Parmar wrote:
                    CAUTION: This email originated from outside
                    of the LMB:
                    *.-dparmar@xxxxxxxxxx-.*
                    Do not click links or open attachments unless
                    you recognize the sender and know the content
                    is safe.
                    If you think this is a phishing email, please
                    forward it to phishing@xxxxxxxxxxxxxxxxx

                    --

                    Hi Ivan,

                    This looks to be similar to the issue [0]
                    that we're already addressing at [1]. So
                    basically there is some out-of-sync event
                    that led the client to make use of the inodes
                    that MDS wasn't aware of/isn't tracking and
                    hence the crash. It'd be really helpful if
                    you can provide us more logs.

                    CC @Rishabh Dave <mailto:ridave@xxxxxxxxxx>
                    @Venky Shankar <mailto:vshankar@xxxxxxxxxx>
                    @Patrick Donnelly
                    <mailto:pdonnell@xxxxxxxxxx> @Xiubo Li
                    <mailto:xiubli@xxxxxxxxxx>

                    [0] https://tracker.ceph.com/issues/61009
                    [1] https://tracker.ceph.com/issues/66251
                    --
                    ***Dhairya Parmar*

                    Associate Software Engineer, CephFS

                    <https://www.redhat.com/>IBM, Inc.

                    On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson
                    <ivan@xxxxxxxxxxxxxxxxx> wrote:

                        Hello,

                        We have been experiencing a serious issue
                        with our CephFS backup cluster
                        running quincy (version 17.2.7) on a
                        RHEL8-derivative Linux kernel
                        (Alma8.9, 4.18.0-513.9.1 kernel) where
                        our MDSes for our filesystem are
                        constantly in a "replay" or
                        "replay(laggy)" state and keep crashing.

                        We have a single MDS filesystem called
                        "ceph_backup" with 2 standby
                        MDSes along with a 2nd unused filesystem
                        "ceph_archive" (this holds
                        little to no data) where we are using our
                        "ceph_backup" filesystem to
                        backup our data and this is the one which
                        is currently broken. The Ceph
                        health outputs currently are:

                        root@pebbles-s1 14:05 [~]: ceph -s
                               cluster:
                                 id:
                        e3f7535e-d35f-4a5d-88f0-a1e97abcd631
                                 health: HEALTH_WARN
                                         1 filesystem is degraded
                        insufficient standby MDS daemons available
                        1319 pgs not deep-scrubbed in time
                        1054 pgs not scrubbed in time

                               services:
                                 mon: 4 daemons, quorum
                        pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4
                        (age 36m)
                                 mgr: pebbles-s2(active, since
                        36m), standbys: pebbles-s4,
                            pebbles-s3, pebbles-s1
                                 mds: 2/2 daemons up
                                 osd: 1380 osds: 1380 up (since
                        29m), 1379 in (since 3d); 37
                            remapped pgs

                               data:
                                 volumes: 1/2 healthy, 1 recovering
                                 pools: 7 pools, 2177 pgs
                                 objects: 3.55G objects, 7.0 PiB
                                 usage: 8.9 PiB used, 14 PiB / 23
                        PiB avail
                                 pgs: 83133528/30006841533
                        objects misplaced (0.277%)
                        2090 active+clean
                        47 active+clean+scrubbing+deep
                        29 active+remapped+backfilling
                        8 active+remapped+backfill_wait
                        2 active+clean+scrubbing
                        1 active+clean+snaptrim

                               io:
                                 recovery: 1.9 GiB/s, 719 objects/s

                        root@pebbles-s1 14:09 [~]: ceph fs status
                            ceph_backup - 0 clients
                            ===========
                            RANK STATE MDS ACTIVITY   DNS INOS  
                        DIRS CAPS
                              0 replay(laggy) pebbles-s3 0      0
                        0      0
                        POOL TYPE     USED AVAIL
                        mds_backup_fs metadata  1255G 2780G
                        ec82_primary_fs_data data       0 2780G
                        ec82pool data    8442T 3044T
                            ceph_archive - 2 clients
                            ============
                            RANK STATE MDS ACTIVITY DNS    INOS  
                        DIRS CAPS
                              0    active pebbles-s2 Reqs:    0
                        /s 13.4k  7105    118 2
                        POOL TYPE     USED AVAIL
                        mds_archive_fs metadata  5184M 2780G
                        ec83_primary_fs_data data       0 2780G
                        ec83pool data     138T 2767T
                            MDS version: ceph version 17.2.7
                        (b12291d110049b2f35e32e0de30d70e9a4c060d2)
                        quincy (stable)
                        root@pebbles-s1 14:09 [~]: ceph health
                        detail | head
                            HEALTH_WARN 1 filesystem is degraded;
                        insufficient standby MDS
                            daemons available; 1319 pgs not
                        deep-scrubbed in time; 1054 pgs not
                            scrubbed in time
                            [WRN] FS_DEGRADED: 1 filesystem is
                        degraded
                                 fs ceph_backup is degraded
                            [WRN] MDS_INSUFFICIENT_STANDBY:
                        insufficient standby MDS daemons
                            available
                                 have 0; want 1 more

                        When our cluster first ran after a
                        reboot, Ceph ran through the 2
                        standby MDSes, crashing them all, until
                        it reached the final MDS and is
                        now stuck in this "replay(laggy)" state.
                        Putting our MDSes into
                        debugging mode, we can see that this MDS
                        crashed when replaying the
                        journal for a particular inode (this is
                        the same for all the MDSes and
                        they all crash on the same object):

                            ...
                        2024-06-24T13:44:55.563+0100 7f8811c40700
                        10 mds.0.journal
                        EMetaBlob.replay for [521,head] had
                        [inode 0x1005ba89481
                            [...539,head]
                        /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
                            auth fragtree_t(*^2 00*^3 00000*^
                            4 00001*^3 00010*^4 00011*^4 00100*^4
                        00101*^4 00110*^4 00111*^4
                            01*^3 01000*^4 01001*^3 01010*^4
                        01011*^3 01100*^4 01101*^4 01110*^4
                            01111*^4 10*^3 10000*^4 10001*^4
                        10010*^4 10011*^4 10100*^4 10101*^3
                            10110*^4 10111*^4 11*^6) v10880645
                        f(v0 m2024-06-22
                        T05:41:10.213700+0100 1281276=1281276+0)
                        n(v12
                        rc2024-06-22T05:41:10.213700+0100
                        b1348251683896 1281277=1281276+1)
                            old_inodes=8 (iversion lock) |
                        dirfrag=416 dirty=1 0x55770a2bdb80]
                        2024-06-24T13:44:55.563+0100 7f8811c40700
                        10 mds.0.journal
                        EMetaBlob.replay dir 0x1005ba89481.011011000*
                        2024-06-24T13:44:55.563+0100 7f8811c40700
                        10 mds.0.journal
                        EMetaBlob.replay updated dir [dir
                        0x1005ba89481.011011000*
                        /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
                            [2,head] auth v=436385 cv=0/0
                        state=107374182
                            4 f(v0
                        m2024-06-22T05:41:10.213700+0100
                        2502=2502+0) n(v12
                        rc2024-06-22T05:41:10.213700+0100
                        b2120744220 2502=2502+0)
                        hs=32+33,ss=0+0 dirty=65 | child=1
                        0x55770ebcda80]
                        2024-06-24T13:44:55.563+0100 7f8811c40700
                        10 mds.0.journal
                        EMetaBlob.replay added (full) [dentry
                        #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
                        ons_ave_Z124.mrc.teberet7.partial
                        [539,head] auth NULL (dversion
                            lock) v=436384 ino=(nil)
                        state=1610612800|bottomlru | dirty=1
                        0x557710444500]
                        2024-06-24T13:44:55.563+0100 7f8811c40700
                        10 mds.0.journal
                        EMetaBlob.replay added [inode
                        0x1005cd4fe35 [539,head]
                        /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
                        005006_fractions_ave_Z124.mrc.teberet7.partial
                        auth v436384 s=0 n(v0
                            1=1+0) (iversion lock)
                        cr={99995144=0-4194304@538} 0x557710438680]
                        2024-06-24T13:44:55.563+0100 7f8811c40700 10
                        mds.0.cache.ino(0x1005cd4fe35)
                        mark_dirty_parent
                        2024-06-24T13:44:55.563+0100 7f8811c40700
                        10 mds.0.journal
                        EMetaBlob.replay noting opened inode
                        [inode 0x1005cd4fe35 [539,head]
                        /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
                        6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                        auth
                            v436384 DIRTYPARENT s=0 n(v0 1=1+0)
                        (iversion lock)
                        cr={99995144=0-4194304@538} |
                        dirtyparent=1 dirty=1 0x557710438680]
                        2024-06-24T13:44:55.563+0100 7f8811c40700
                        10 mds.0.journal
                        EMetaBlob.replay inotable tablev 3112837
                        <= table 3112837
                        2024-06-24T13:44:55.563+0100 7f8811c40700
                        10 mds.0.journal
                        EMetaBlob.replay sessionmap v 1560540883,
                        table 1560540882 prealloc
                            [] used 0x1005cd4fe35
                        2024-06-24T13:44:55.563+0100 7f8811c40700 -1
                        /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                            I
                            n function 'void interval_set<T,
                        C>::erase(T, T,
                        std::function<bool(T, T)>) [with T =
                        inodeno_t; C = std::map]'
                            thread 7f8811c40700 time
                        2024-06-24T13:44:55.564315+0100
                        /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                            568: FAILED ceph_assert(p->first <=
                        start)

                              ceph version 17.2.7
                        (b12291d110049b2f35e32e0de30d70e9a4c060d2)
                            quincy (stable)
                              1: (ceph::__ceph_assert_fail(char
                        const*, char const*, int, char
                            const*)+0x135) [0x7f8821e814a3]
                              2:
                        /usr/lib64/ceph/libceph-common.so.2(+0x269669)
                        [0x7f8821e81669]
                              3: (interval_set<inodeno_t,
                        std::map>::erase(inodeno_t, inodeno_t,
                        std::function<bool (inodeno_t,
                        inodeno_t)>)+0x2e5) [0x5576f9bb2885]
                              4: (EMetaBlob::replay(MDSRank*,
                        LogSegment*, int,
                        MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
                              5: (EUpdate::replay(MDSRank*)+0x61)
                        [0x5576f9ebbbd1]
                              6: (MDLog::_replay_thread()+0x7bb)
                        [0x5576f9e4254b]
                              7:
                        (MDLog::ReplayThread::entry()+0x11)
                        [0x5576f9af5041]
                              8: /lib64/libpthread.so.0(+0x81ca)
                        [0x7f8820e6f1ca]
                              9: clone()

                        I've only included a short section of the
                        crash (this is the first trace
                        in the log with regards to the crash with
                        a 10/20 debug_mds option). We
                        tried deleting the 0x1005cd4fe35 object
                        from the object store using the
                        "rados" command but this did not allow
                        our MDS to successfully replay.

                         From my understanding the journal seems
                        okay as we didn't run out of
                        space for example on our metadata pool
                        and "cephfs-journal-tool journal
                        inspect" doesn't seem to think there is
                        any damage:

                        root@pebbles-s1 13:58 [~]:
                        cephfs-journal-tool --rank=ceph_backup:0
                            journal inspect
                            Overall journal integrity: OK
                        root@pebbles-s1 14:04 [~]:
                        cephfs-journal-tool --rank=ceph_backup:0
                            event get --inode 1101069090357 summary
                            Events by type:
                               OPEN: 1
                               UPDATE: 3
                            Errors: 0
                        root@pebbles-s1 14:05 [~]:
                        cephfs-journal-tool --rank=ceph_backup:0
                            event get --inode 1101069090357 list
                        2024-06-22T05:41:10.214635+0100
                        0x51f97d4cfe35 UPDATE:  (openc)
                        test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                        2024-06-22T05:41:11.203312+0100
                        0x51f97d59c848 UPDATE:
                        (check_inode_max_size)
                        test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                        test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                        2024-06-22T05:41:15.484871+0100
                        0x51f97e7344cc OPEN:  ()
                        FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                        2024-06-22T05:41:15.484921+0100
                        0x51f97e73493b UPDATE:  (rename)
                        test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                        test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc

                        I was wondering whether anyone had any
                        advice for us on how we should
                        proceed forward? We were thinking about
                        manually applying these events
                        (via "event apply") where failing that we
                        could erase this problematic
                        event with "cephfs-journal-tool
                        --rank=ceph_backup:0 event splice
                        --inode 1101069090357". Is this a good
                        idea? We would rather not rebuild
                        the entire metadata pool if we could
                        avoid it (once was enough for us)
                        as this cluster has ~9 PB of data on it.

                        Kindest regards,

                        Ivan Clayson

                        -- 
                        Ivan Clayson
                        -----------------
                        Scientific Computing Officer
                        Room 2N249
                        Structural Studies
                        MRC Laboratory of Molecular Biology
                        Francis Crick Ave, Cambridge
                        CB2 0QH
                        _______________________________________________
                        ceph-users mailing list -- ceph-users@xxxxxxx
                        To unsubscribe send an email to
                        ceph-users-leave@xxxxxxx

                    -- 
                    Ivan Clayson
                    -----------------
                    Scientific Computing Officer
                    Room 2N249
                    Structural Studies
                    MRC Laboratory of Molecular Biology
                    Francis Crick Ave, Cambridge
                    CB2 0QH

                -- 
                Ivan Clayson
                -----------------
                Scientific Computing Officer
                Room 2N249
                Structural Studies
                MRC Laboratory of Molecular Biology
                Francis Crick Ave, Cambridge
                CB2 0QH

            -- 
            Ivan Clayson
            -----------------
            Scientific Computing Officer
            Room 2N249
            Structural Studies
            MRC Laboratory of Molecular Biology
            Francis Crick Ave, Cambridge
            CB2 0QH

        -- 
        Ivan Clayson
        -----------------
        Scientific Computing Officer
        Room 2N249
        Structural Studies
        MRC Laboratory of Molecular Biology
        Francis Crick Ave, Cambridge
        CB2 0QH

    -- 
    Ivan Clayson
    -----------------
    Scientific Computing Officer
    Room 2N249
    Structural Studies
    MRC Laboratory of Molecular Biology
    Francis Crick Ave, Cambridge
    CB2 0QH

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx