Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Tim Holloway <timh@xxxxxxxxxxxxx> · Tue, 09 Jul 2024 11:19:43 -0400

Ivan,

This may be a little off-topic, but if you're still running AlmaLinux
8,9, it's worth noting that CentOS 8 actually end-of-lifed about 2
years ago, thanks to CentOS Stream.

Up until this last week, however, I had several AlmaLinux 8 machines
running myself, but apparently somewhere around May IBM Red Hat pulled
all of its CentOS8 enterprise sites offline, including Storage and
Ceph, which broke my yum updates.

While as far as I'm aware, once you've installed cephadm (whether via
yum/dnf or otherwise), there's no further need for the RPM repos,
losing yum support is not helping at the very least.

On the upside, it's possible to upgrade-in-place from AlmaLinux 8.9 to
AlmaLinux 9, although it may require temporarily disabling certain OS
services to appease the upgrade process.

Probably won't solve your problem, but at least you'll be able to move
fairly painlessly to a better-supported platform.

  Best Regards,
     Tim

On Tue, 2024-07-09 at 11:14 +0100, Ivan Clayson wrote:
> Hi Dhairya,
> 
> I would be more than happy to try and give as many details as
> possible 
> but the slack channel is private and requires my email to have an 
> account/ access to it.
> 
> Wouldn't taking the discussion about this error to a private channel 
> also stop other users who experience this error from learning about
> how 
> and why this happened as  well as possibly not be able to view the 
> solution? Would it not be possible to discuss this more publicly for
> the 
> benefit of the other users on the mailing list?
> 
> Kindest regards,
> 
> Ivan
> 
> On 09/07/2024 10:44, Dhairya Parmar wrote:
> > CAUTION: This email originated from outside of the LMB:
> > *.-dparmar@xxxxxxxxxx-.*
> > Do not click links or open attachments unless you recognize the
> > sender 
> > and know the content is safe.
> > If you think this is a phishing email, please forward it to 
> > phishing@xxxxxxxxxxxxxxxxx
> > 
> > 
> > --
> > 
> > Hey Ivan,
> > 
> > This is a relatively new MDS crash, so this would require some 
> > investigation but I was instructed to recommend disaster-recovery 
> > steps [0] (except session reset) to you to get the FS up again.
> > 
> > This crash is being discussed on upstream CephFS slack channel [1] 
> > with @Venky Shankar <mailto:vshankar@xxxxxxxxxx> and other CephFS 
> > devs. I'd encourage you to join the conversation, we can discuss
> > this 
> > in detail and maybe go through the incident step by step which
> > should 
> > help analyse the crash better.
> > 
> > [0] 
> > https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
> > [1]
> > https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519
> > 
> > On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson
> > <ivan@xxxxxxxxxxxxxxxxx> 
> > wrote:
> > 
> >     Hi Dhairya,
> > 
> >     Thank you ever so much for having another look at this so
> > quickly.
> >     I don't think I have any logs similar to the ones you
> > referenced
> >     this time as my MDSs don't seem to enter the replay stage when
> >     they crash (or at least don't now after I've thrown the logs
> > away)
> >     but those errors do crop up in the prior logs I shared when the
> >     system first crashed.
> > 
> >     Kindest regards,
> > 
> >     Ivan
> > 
> >     On 08/07/2024 14:08, Dhairya Parmar wrote:
> > >     CAUTION: This email originated from outside of the LMB:
> > >     *.-dparmar@xxxxxxxxxx-.*
> > >     Do not click links or open attachments unless you recognize
> > > the
> > >     sender and know the content is safe.
> > >     If you think this is a phishing email, please forward it to
> > >     phishing@xxxxxxxxxxxxxxxxx
> > > 
> > > 
> > >     --
> > > 
> > >     Ugh, something went horribly wrong. I've downloaded the MDS
> > > logs
> > >     that contain assertion failure and it looks relevant to this
> > > [0].
> > >     Do you have client logs for this?
> > > 
> > >     The other log that you shared is being downloaded right now,
> > > once
> > >     that's done and I'm done going through it, I'll update you.
> > > 
> > >     [0] https://tracker.ceph.com/issues/54546
> > > 
> > >     On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson
> > >     <ivan@xxxxxxxxxxxxxxxxx> wrote:
> > > 
> > >         Hi Dhairya,
> > > 
> > >         Sorry to resurrect this thread again, but we still
> > >         unfortunately have an issue with our filesystem after we
> > >         attempted to write new backups to it.
> > > 
> > >         We finished the scrub of the filesystem on Friday and ran
> > > a
> > >         repair scrub on the 1 directory which had metadata
> > > damage.
> > >         After doing so and rebooting, the cluster reported no
> > > issues
> > >         and data was accessible again.
> > > 
> > >         We re-started the backups to run over the weekend and
> > >         unfortunately the filesystem crashed again where the log
> > > of
> > >         the failure is here:
> > >        
> > > https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz
> > > .
> > >         We ran the backups on kernel mounts of the filesystem
> > > without
> > >         the nowsync option this time to avoid the out-of-sync
> > > write
> > >         problems..
> > > 
> > >         I've tried resetting the journal again after recovering
> > > the
> > >         dentries but unfortunately the filesystem is still in a
> > >         failed state despite setting joinable to true. The log of
> > >         this crash is here:
> > >        
> > > https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708
> > > .
> > > 
> > >         I'm not sure how to proceed as I can't seem to get any
> > > MDS to
> > >         take over the first rank. I would like to do a scrub of
> > > the
> > >         filesystem and preferably overwrite the troublesome files
> > >         with the originals on the live filesystem. Do you have
> > > any
> > >         advice on how to make the filesystem leave its failed
> > > state?
> > >         I have a backup of the journal before I reset it so I can
> > >         roll back if necessary.
> > > 
> > >         Here are some details about the filesystem at present:
> > > 
> > >             root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
> > >               cluster:
> > >                 id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
> > >                 health: HEALTH_ERR
> > >                         1 filesystem is degraded
> > >                         1 large omap objects
> > >                         1 filesystem is offline
> > >                         1 mds daemon damaged
> > >             nobackfill,norebalance,norecover,noscrub,nodeep-
> > > scrub,nosnaptrim
> > >             flag(s) set
> > >                         1750 pgs not deep-scrubbed in time
> > >                         1612 pgs not scrubbed in time
> > > 
> > >               services:
> > >                 mon: 4 daemons, quorum
> > >             pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
> > >                 mgr: pebbles-s2(active, since 77m), standbys:
> > >             pebbles-s1, pebbles-s3, pebbles-s4
> > >                 mds: 1/2 daemons up, 3 standby
> > >                 osd: 1380 osds: 1380 up (since 76m), 1379 in
> > > (since
> > >             10d); 10 remapped pgs
> > >                      flags
> > >             nobackfill,norebalance,norecover,noscrub,nodeep-
> > > scrub,nosnaptrim
> > > 
> > >               data:
> > >                 volumes: 1/2 healthy, 1 recovering; 1 damaged
> > >                 pools:   7 pools, 2177 pgs
> > >                 objects: 3.24G objects, 6.7 PiB
> > >                 usage:   8.6 PiB used, 14 PiB / 23 PiB avail
> > >                 pgs:     11785954/27384310061 objects misplaced
> > > (0.043%)
> > >                          2167 active+clean
> > >                          6    active+remapped+backfilling
> > >                          4    active+remapped+backfill_wait
> > > 
> > >             ceph_backup - 0 clients
> > >             ===========
> > >             RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS CAPS
> > >              0 failed
> > >                     POOL            TYPE     USED  AVAIL
> > >                mds_backup_fs      metadata  1174G  3071G
> > >             ec82_primary_fs_data    data       0   3071G
> > >                   ec82pool          data    8085T  4738T
> > >             ceph_archive - 2 clients
> > >             ============
> > >             RANK  STATE      MDS         ACTIVITY     DNS INOS  
> > >             DIRS   CAPS
> > >              0    active  pebbles-s4  Reqs:    0 /s  13.4k
> > > 7105   
> > >             118      2
> > >                     POOL            TYPE     USED  AVAIL
> > >                mds_archive_fs     metadata  5184M  3071G
> > >             ec83_primary_fs_data    data       0   3071G
> > >                   ec83pool          data     138T  4307T
> > >             STANDBY MDS
> > >              pebbles-s2
> > >              pebbles-s3
> > >              pebbles-s1
> > >             MDS version: ceph version 17.2.7
> > >             (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
> > > (stable)
> > >             root@pebbles-s2 11:55 [~]: ceph fs dump
> > >             e2643889
> > >             enable_multiple, ever_enabled_multiple: 1,1
> > >             default compat:
> > > compat={},rocompat={},incompat={1=base
> > >             v0.20,2=client writeable ranges,3=default file
> > > layouts on
> > >             dirs,4=dir inode in separate object,5=mds uses
> > > versioned
> > >             encoding,6=dirfrag is stored in omap,8=no anchor
> > >             table,9=file layout v2,10=snaprealm v2}
> > >             legacy client fscid: 1
> > > 
> > >             Filesystem 'ceph_backup' (1)
> > >             fs_name    ceph_backup
> > >             epoch    2643888
> > >             flags    12 joinable allow_snaps allow_multimds_snaps
> > >             created    2023-05-19T12:52:36.302135+0100
> > >             modified    2024-07-08T11:17:55.437861+0100
> > >             tableserver    0
> > >             root    0
> > >             session_timeout    60
> > >             session_autoclose    300
> > >             max_file_size    109934182400000
> > >             required_client_features    {}
> > >             last_failure    0
> > >             last_failure_osd_epoch    494515
> > >             compat    compat={},rocompat={},incompat={1=base
> > >             v0.20,2=client writeable ranges,3=default file
> > > layouts on
> > >             dirs,4=dir inode in separate object,5=mds uses
> > > versioned
> > >             encoding,6=dirfrag is stored in omap,7=mds uses
> > > inline
> > >             data,8=no anchor table,9=file layout v2,10=snaprealm
> > > v2}
> > >             max_mds    1
> > >             in    0
> > >             up    {}
> > >             failed
> > >             damaged    0
> > >             stopped
> > >             data_pools    [6,3]
> > >             metadata_pool    2
> > >             inline_data    disabled
> > >             balancer
> > >             standby_count_wanted    1
> > > 
> > > 
> > >         Kindest regards,
> > > 
> > >         Ivan
> > > 
> > >         On 28/06/2024 15:17, Dhairya Parmar wrote:
> > > >         CAUTION: This email originated from outside of the LMB:
> > > >         *.-dparmar@xxxxxxxxxx-.*
> > > >         Do not click links or open attachments unless you
> > > > recognize
> > > >         the sender and know the content is safe.
> > > >         If you think this is a phishing email, please forward
> > > > it to
> > > >         phishing@xxxxxxxxxxxxxxxxx
> > > > 
> > > > 
> > > >         --
> > > > 
> > > > 
> > > > 
> > > >         On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson
> > > >         <ivan@xxxxxxxxxxxxxxxxx> wrote:
> > > > 
> > > >             Hi Dhairya,
> > > > 
> > > >             I would be more than happy to share our corrupted
> > > >             journal. Has the host key changed for drop.ceph.com
> > > >             <http://drop.ceph.com>? The fingerprint I'm being
> > > > sent
> > > >             is 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4
> > > > which is
> > > >             different to the one in our
> > > >             /usr/share/ceph/known_hosts_drop.ceph.com
> > > >             <http://known_hosts_drop.ceph.com>.
> > > > 
> > > >         Ah, strange. Let me get in touch with folks who might
> > > > know
> > > >         about this, will revert back to you ASAP
> > > > 
> > > >             Thank you for your advice as well. We've reset our
> > > > MDS'
> > > >             journal and are currently in the process of a full
> > > >             filesystem scrub which understandably is taking
> > > > quite a
> > > >             bit of time but seems to be progressing through the
> > > >             objects fine.
> > > > 
> > > >         YAY!
> > > > 
> > > >             Thank you ever so much for all your help and please
> > > > do
> > > >             feel free to follow up with us if you would like
> > > > any
> > > >             further details about our crash!
> > > > 
> > > >         Glad to hear it went well, this bug is being worked on
> > > > with
> > > >         high priority and once the patch is ready, it will be
> > > >         backported.
> > > > 
> > > >         The root cause of this issue is the `nowsync` (async
> > > > dirops)
> > > >         being enabled by default with kclient [0]. This feature
> > > >         allows asynchronous creation and deletion of files,
> > > >         optimizing performance by avoiding round-trip latency
> > > > for
> > > >         these system calls. However, in very rare cases (like
> > > > yours
> > > >         :D), it can affect the system's consistency and
> > > > stability
> > > >         hence if this kind of optimization is not a priority
> > > > for
> > > >         your workload, I recommend turning it off by switching
> > > >         the mount points to `wsync` and also set the MDS config
> > > >         `mds_client_delegate_inos_pct` to `0` so that you don't
> > > > end
> > > >         up in this situation again (until the bug fix arrives
> > > > :)).
> > > > 
> > > >         [0]
> > > >        
> > > > https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902
> > > > 
> > > >             Kindest regards,
> > > > 
> > > >             Ivan
> > > > 
> > > >             On 27/06/2024 12:39, Dhairya Parmar wrote:
> > > > >             CAUTION: This email originated from outside of
> > > > > the LMB:
> > > > >             *.-dparmar@xxxxxxxxxx-.*
> > > > >             Do not click links or open attachments unless you
> > > > >             recognize the sender and know the content is
> > > > > safe.
> > > > >             If you think this is a phishing email, please
> > > > > forward
> > > > >             it to phishing@xxxxxxxxxxxxxxxxx
> > > > > 
> > > > > 
> > > > >             --
> > > > > 
> > > > >             Hi Ivan,
> > > > > 
> > > > >             The solution (which has been successful for us in
> > > > > the
> > > > >             past) is to reset the journal. This would bring
> > > > > the fs
> > > > >             back online and return the MDSes to a stable
> > > > > state, but
> > > > >             some data would be lost—the data in the journal
> > > > > that
> > > > >             hasn't been flushed to the backing store would be
> > > > > gone.
> > > > >             Therefore, you should try to flush out as much
> > > > > journal
> > > > >             data as possible before resetting the journal.
> > > > > 
> > > > >             Here are the steps for this entire process:
> > > > > 
> > > > >             1) Bring the FS offline
> > > > >             $ ceph fs fail <fs_name>
> > > > > 
> > > > >             2) Recover dentries from journal (run it with
> > > > > every MDS
> > > > >             Rank)
> > > > >             $ cephfs-journal-tool --rank=<fs_name>:<mds-rank>
> > > > > event
> > > > >             recover_dentries summary
> > > > > 
> > > > >             3) Reset the journal (again with every MDS Rank)
> > > > >             $ cephfs-journal-tool --rank=<fs_name>:<mds-rank>
> > > > >             journal reset
> > > > > 
> > > > >             4) Bring the FS online
> > > > >             $ cephfs fs set <fs_name> joinable true
> > > > > 
> > > > >             5) Restart the MDSes
> > > > > 
> > > > >             6) Perform scrub to ensure consistency of fs
> > > > >             $ ceph tell mds.<fs_name>:0 scrub start <path>
> > > > >             [scrubopts] [tag]
> > > > >             # you could try a recursive scrub maybe `ceph
> > > > > tell
> > > > >             mds.<fs_name>:0 scrub start / recursive`
> > > > > 
> > > > >             Some important notes to keep in mind:
> > > > >             * Recovering dentries will take time (generally,
> > > > > rank 0
> > > > >             is the most time-consuming, but the rest should
> > > > > be quick).
> > > > >             * cephfs-journal-tool and metadata OSDs are bound
> > > > > to
> > > > >             use a significant CPU percentage. This is because
> > > > >             cephfs-journal-tool has to swig the journal data
> > > > > and
> > > > >             flush it out to the backing store, which also
> > > > > makes the
> > > > >             metadata operations go rampant, resulting in OSDs
> > > > >             taking a significant percentage of CPU.
> > > > > 
> > > > >             Do let me know how this goes.
> > > > > 
> > > > >             On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson
> > > > >             <ivan@xxxxxxxxxxxxxxxxx> wrote:
> > > > > 
> > > > >                 Hi Dhairya,
> > > > > 
> > > > >                 We can induce the crash by simply restarting
> > > > > the
> > > > >                 MDS and the crash seems to happen when an MDS
> > > > > goes
> > > > >                 from up:standby to up:replay. The MDS works
> > > > > through
> > > > >                 a few files in the log before eventually
> > > > > crashing
> > > > >                 where I've included the logs for this here
> > > > > (this is
> > > > >                 after I imported the backed up journal which
> > > > > I hope
> > > > >                 was successful but please let me know if you
> > > > >                 suspect it wasn't!):
> > > > >                
> > > > > https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log
> > > > > 
> > > > >                 With respect to the client logs, are you
> > > > > referring
> > > > >                 to the clients who are writing to the
> > > > > filesystem?
> > > > >                 We don't typically run them in any sort of
> > > > > debug
> > > > >                 mode and we have quite a few machines running
> > > > > our
> > > > >                 backup system but we can look an hour or so
> > > > > before
> > > > >                 the first MDS crash (though I don't know if
> > > > > this is
> > > > >                 when the de-sync occurred). Here are some MDS
> > > > > logs
> > > > >                 with regards to the initial crash on Saturday
> > > > >                 morning though which may be helpful:
> > > > > 
> > > > >                        -59> 2024-06-22T05:41:43.090+0100
> > > > >                     7f184ce82700 10 monclient: tick
> > > > >                        -58> 2024-06-22T05:41:43.090+0100
> > > > >                     7f184ce82700 10 monclient:
> > > > > _check_auth_rotating
> > > > >                     have uptodate secrets (they expire after
> > > > >                     2024-06-22T05:41:13.091556+0100)
> > > > >                        -57> 2024-06-22T05:41:43.208+0100
> > > > >                     7f184de84700  1 mds.pebbles-s2 Updating
> > > > > MDS map
> > > > >                     to version 2529650 from mon.3
> > > > >                        -56> 2024-06-22T05:41:43.208+0100
> > > > >                     7f184de84700  4 mds.0.purge_queue
> > > > > operator():
> > > > >                     data pool 6 not found in OSDMap
> > > > >                        -55> 2024-06-22T05:41:43.208+0100
> > > > >                     7f184de84700  4 mds.0.purge_queue
> > > > > operator():
> > > > >                     data pool 3 not found in OSDMap
> > > > >                        -54> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700  5 asok(0x5592e7968000)
> > > > >                     register_command objecter_requests hook
> > > > >                     0x5592e78f8800
> > > > >                        -53> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700 10 monclient: _renew_subs
> > > > >                        -52> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700 10 monclient:
> > > > > _send_mon_message to
> > > > >                     mon.pebbles-s4 at v2:10.1.5.134:3300/0
> > > > >                     <http://10.1.5.134:3300/0>
> > > > >                        -51> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700 10 log_channel(cluster)
> > > > >                     update_config to_monitors: true
> > > > > to_syslog:
> > > > >                     false syslog_facility:  prio: info
> > > > > to_graylog:
> > > > >                     false graylog_host: 127.0.0.1
> > > > > graylog_port: 12201)
> > > > >                        -50> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700  4 mds.0.purge_queue
> > > > > operator():
> > > > >                     data pool 6 not found in OSDMap
> > > > >                        -49> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700  4 mds.0.purge_queue
> > > > > operator():
> > > > >                     data pool 3 not found in OSDMap
> > > > >                        -48> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700  4 mds.0.0 apply_blocklist:
> > > > > killed
> > > > >                     0, blocklisted sessions (0 blocklist
> > > > > entries, 0)
> > > > >                        -47> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700  1 mds.0.2529650
> > > > > handle_mds_map i
> > > > >                     am now mds.0.2529650
> > > > >                        -46> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700  1 mds.0.2529650
> > > > > handle_mds_map
> > > > >                     state change up:standby --> up:replay
> > > > >                        -45> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700  5 mds.beacon.pebbles-s2
> > > > >                     set_want_state: up:standby -> up:replay
> > > > >                        -44> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700  1 mds.0.2529650
> > > > > replay_start
> > > > >                        -43> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700  1 mds.0.2529650 waiting for
> > > > >                     osdmap 473739 (which blocklists prior
> > > > > instance)
> > > > >                        -42> 2024-06-22T05:41:43.209+0100
> > > > >                     7f184de84700 10 monclient:
> > > > > _send_mon_message to
> > > > >                     mon.pebbles-s4 at v2:10.1.5.134:3300/0
> > > > >                     <http://10.1.5.134:3300/0>
> > > > >                        -41> 2024-06-22T05:41:43.209+0100
> > > > >                     7f1849e7c700  2 mds.0.cache Memory
> > > > > usage: 
> > > > >                     total 299012, rss 37624, heap 182556,
> > > > > baseline
> > > > >                     182556, 0 / 0 inodes have caps, 0 caps, 0
> > > > > caps
> > > > >                     per inode
> > > > >                        -40> 2024-06-22T05:41:43.224+0100
> > > > >                     7f184de84700 10 monclient: _renew_subs
> > > > >                        -39> 2024-06-22T05:41:43.224+0100
> > > > >                     7f184de84700 10 monclient:
> > > > > _send_mon_message to
> > > > >                     mon.pebbles-s4 at v2:10.1.5.134:3300/0
> > > > >                     <http://10.1.5.134:3300/0>
> > > > >                        -38> 2024-06-22T05:41:43.224+0100
> > > > >                     7f184de84700 10 monclient:
> > > > >                     handle_get_version_reply finishing 1
> > > > > version 473739
> > > > >                        -37> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847e78700  2 mds.0.2529650 Booting: 0:
> > > > >                     opening inotable
> > > > >                        -36> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847e78700  2 mds.0.2529650 Booting: 0:
> > > > >                     opening sessionmap
> > > > >                        -35> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847e78700  2 mds.0.2529650 Booting: 0:
> > > > >                     opening mds log
> > > > >                        -34> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847e78700  5 mds.0.log open
> > > > > discovering log
> > > > >                     bounds
> > > > >                        -33> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847e78700  2 mds.0.2529650 Booting: 0:
> > > > >                     opening purge queue (async)
> > > > >                        -32> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847e78700  4 mds.0.purge_queue open:
> > > > > opening
> > > > >                        -31> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847e78700  1 mds.0.journaler.pq(ro)
> > > > > recover
> > > > >                     start
> > > > >                        -30> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847e78700  1 mds.0.journaler.pq(ro)
> > > > > read_head
> > > > >                        -29> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847e78700  2 mds.0.2529650 Booting: 0:
> > > > >                     loading open file table (async)
> > > > >                        -28> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847e78700  2 mds.0.2529650 Booting: 0:
> > > > >                     opening snap table
> > > > >                        -27> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1847677700  4 mds.0.journalpointer
> > > > > Reading
> > > > >                     journal pointer '400.00000000'
> > > > >                        -26> 2024-06-22T05:41:43.224+0100
> > > > >                     7f1850689700 10 monclient:
> > > > > get_auth_request con
> > > > >                     0x5592e8987000 auth_method 0
> > > > >                        -25> 2024-06-22T05:41:43.225+0100
> > > > >                     7f1850e8a700 10 monclient:
> > > > > get_auth_request con
> > > > >                     0x5592e8987c00 auth_method 0
> > > > >                        -24> 2024-06-22T05:41:43.252+0100
> > > > >                     7f1848e7a700  1 mds.0.journaler.pq(ro)
> > > > >                     _finish_read_head loghead(trim
> > > > > 231160676352,
> > > > >                     expire 231163662875, write 231163662875,
> > > > >                     stream_format 1).  probing for end of log
> > > > > (from
> > > > >                     231163662875)...
> > > > >                        -23> 2024-06-22T05:41:43.252+0100
> > > > >                     7f1848e7a700  1 mds.0.journaler.pq(ro)
> > > > > probing
> > > > >                     for end of the log
> > > > >                        -22> 2024-06-22T05:41:43.252+0100
> > > > >                     7f1847677700  1 mds.0.journaler.mdlog(ro)
> > > > >                     recover start
> > > > >                        -21> 2024-06-22T05:41:43.252+0100
> > > > >                     7f1847677700  1 mds.0.journaler.mdlog(ro)
> > > > > read_head
> > > > >                        -20> 2024-06-22T05:41:43.252+0100
> > > > >                     7f1847677700  4 mds.0.log Waiting for
> > > > > journal
> > > > >                     0x200 to recover...
> > > > >                        -19> 2024-06-22T05:41:43.252+0100
> > > > >                     7f1850689700 10 monclient:
> > > > > get_auth_request con
> > > > >                     0x5592e8bc6000 auth_method 0
> > > > >                        -18> 2024-06-22T05:41:43.253+0100
> > > > >                     7f185168b700 10 monclient:
> > > > > get_auth_request con
> > > > >                     0x5592e8bc6800 auth_method 0
> > > > >                        -17> 2024-06-22T05:41:43.257+0100
> > > > >                     7f1847e78700  1 mds.0.journaler.mdlog(ro)
> > > > >                     _finish_read_head loghead(trim
> > > > > 90131453181952,
> > > > >                     expire 90131465778558, write
> > > > > 90132009715463,
> > > > >                     stream_format 1).  probing for end of log
> > > > > (from
> > > > >                     90132009715463)...
> > > > >                        -16> 2024-06-22T05:41:43.257+0100
> > > > >                     7f1847e78700  1 mds.0.journaler.mdlog(ro)
> > > > >                     probing for end of the log
> > > > >                        -15> 2024-06-22T05:41:43.257+0100
> > > > >                     7f1847e78700  1 mds.0.journaler.mdlog(ro)
> > > > >                     _finish_probe_end write_pos =
> > > > > 90132019384791
> > > > >                     (header had 90132009715463). recovered.
> > > > >                        -14> 2024-06-22T05:41:43.257+0100
> > > > >                     7f1847677700  4 mds.0.log Journal 0x200
> > > > > recovered.
> > > > >                        -13> 2024-06-22T05:41:43.257+0100
> > > > >                     7f1847677700  4 mds.0.log Recovered
> > > > > journal
> > > > >                     0x200 in format 1
> > > > >                        -12> 2024-06-22T05:41:43.273+0100
> > > > >                     7f1848e7a700  1 mds.0.journaler.pq(ro)
> > > > >                     _finish_probe_end write_pos =
> > > > > 231163662875
> > > > >                     (header had 231163662875). recovered.
> > > > >                        -11> 2024-06-22T05:41:43.273+0100
> > > > >                     7f1848e7a700  4 mds.0.purge_queue
> > > > > operator():
> > > > >                     open complete
> > > > >                        -10> 2024-06-22T05:41:43.273+0100
> > > > >                     7f1848e7a700  1 mds.0.journaler.pq(ro)
> > > > >                     set_writeable
> > > > >                         -9> 2024-06-22T05:41:43.441+0100
> > > > >                     7f1847e78700  2 mds.0.2529650 Booting: 1:
> > > > >                     loading/discovering base inodes
> > > > >                         -8> 2024-06-22T05:41:43.441+0100
> > > > >                     7f1847e78700  0 mds.0.cache creating
> > > > > system
> > > > >                     inode with ino:0x100
> > > > >                         -7> 2024-06-22T05:41:43.442+0100
> > > > >                     7f1847e78700  0 mds.0.cache creating
> > > > > system
> > > > >                     inode with ino:0x1
> > > > >                         -6> 2024-06-22T05:41:43.442+0100
> > > > >                     7f1847e78700  2 mds.0.2529650 Booting: 2:
> > > > >                     replaying mds log
> > > > >                         -5> 2024-06-22T05:41:43.442+0100
> > > > >                     7f1847e78700  2 mds.0.2529650 Booting: 2:
> > > > >                     waiting for purge queue recovered
> > > > >                         -4> 2024-06-22T05:41:44.090+0100
> > > > >                     7f184ce82700 10 monclient: tick
> > > > >                         -3> 2024-06-22T05:41:44.090+0100
> > > > >                     7f184ce82700 10 monclient:
> > > > > _check_auth_rotating
> > > > >                     have uptodate secrets (they expire after
> > > > >                     2024-06-22T05:41:14.091638+0100)
> > > > >                         -2> 2024-06-22T05:41:44.210+0100
> > > > >                     7f1849e7c700  2 mds.0.cache Memory
> > > > > usage: 
> > > > >                     total 588368, rss 308304, heap 207132,
> > > > > baseline
> > > > >                     182556, 0 / 15149 inodes have caps, 0
> > > > > caps, 0
> > > > >                     caps per inode
> > > > >                         -1> 2024-06-22T05:41:44.642+0100
> > > > >                     7f1846675700 -1
> > > > >                     /home/jenkins-build/build/workspace/ceph-
> > > > > build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos
> > > > > 8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/B
> > > > > UILD/ceph-17.2.7/src/include/interval_set.h:
> > > > >                     In function 'void interval_set<T,
> > > > > C>::erase(T,
> > > > >                     T, std::function<bool(T, T)>) [with T =
> > > > >                     inodeno_t; C = std::map]' thread
> > > > > 7f1846675700
> > > > >                     time 2024-06-22T05:41:44.643146+0100
> > > > > 
> > > > >                      ceph version 17.2.7
> > > > >                    
> > > > > (b12291d110049b2f35e32e0de30d70e9a4c060d2)
> > > > >                     quincy (stable)
> > > > >                      1: (ceph::__ceph_assert_fail(char
> > > > > const*, char
> > > > >                     const*, int, char const*)+0x135)
> > > > > [0x7f18568b64a3]
> > > > >                      2:
> > > > >                     /usr/lib64/ceph/libceph-
> > > > > common.so.2(+0x269669)
> > > > >                     [0x7f18568b6669]
> > > > >                      3: (interval_set<inodeno_t,
> > > > >                     std::map>::erase(inodeno_t, inodeno_t,
> > > > >                     std::function<bool (inodeno_t,
> > > > >                     inodeno_t)>)+0x2e5) [0x5592e5027885]
> > > > >                      4: (EMetaBlob::replay(MDSRank*,
> > > > > LogSegment*,
> > > > >                     int, MDPeerUpdate*)+0x4377)
> > > > > [0x5592e532c7b7]
> > > > >                      5: (EUpdate::replay(MDSRank*)+0x61)
> > > > >                     [0x5592e5330bd1]
> > > > >                      6: (MDLog::_replay_thread()+0x7bb)
> > > > >                     [0x5592e52b754b]
> > > > >                      7: (MDLog::ReplayThread::entry()+0x11)
> > > > >                     [0x5592e4f6a041]
> > > > >                      8: /lib64/libpthread.so.0(+0x81ca)
> > > > >                     [0x7f18558a41ca]
> > > > >                      9: clone()
> > > > > 
> > > > >                          0> 2024-06-22T05:41:44.643+0100
> > > > >                     7f1846675700 -1 *** Caught signal
> > > > > (Aborted) **
> > > > >                      in thread 7f1846675700
> > > > > thread_name:md_log_replay
> > > > > 
> > > > >                      ceph version 17.2.7
> > > > >                    
> > > > > (b12291d110049b2f35e32e0de30d70e9a4c060d2)
> > > > >                     quincy (stable)
> > > > >                      1: /lib64/libpthread.so.0(+0x12cf0)
> > > > >                     [0x7f18558aecf0]
> > > > >                      2: gsignal()
> > > > >                      3: abort()
> > > > >                      4: (ceph::__ceph_assert_fail(char
> > > > > const*, char
> > > > >                     const*, int, char const*)+0x18f)
> > > > > [0x7f18568b64fd]
> > > > >                      5:
> > > > >                     /usr/lib64/ceph/libceph-
> > > > > common.so.2(+0x269669)
> > > > >                     [0x7f18568b6669]
> > > > >                      6: (interval_set<inodeno_t,
> > > > >                     std::map>::erase(inodeno_t, inodeno_t,
> > > > >                     std::function<bool (inodeno_t,
> > > > >                     inodeno_t)>)+0x2e5) [0x5592e5027885]
> > > > >                      7: (EMetaBlob::replay(MDSRank*,
> > > > > LogSegment*,
> > > > >                     int, MDPeerUpdate*)+0x4377)
> > > > > [0x5592e532c7b7]
> > > > >                      8: (EUpdate::replay(MDSRank*)+0x61)
> > > > >                     [0x5592e5330bd1]
> > > > >                      9: (MDLog::_replay_thread()+0x7bb)
> > > > >                     [0x5592e52b754b]
> > > > >                      10: (MDLog::ReplayThread::entry()+0x11)
> > > > >                     [0x5592e4f6a041]
> > > > >                      11: /lib64/libpthread.so.0(+0x81ca)
> > > > >                     [0x7f18558a41ca]
> > > > >                      12: clone()
> > > > > 
> > > > >                 We have a relatively low debug setting
> > > > > normally so
> > > > >                 I don't think many details of the initial
> > > > > crash
> > > > >                 were captured unfortunately and the MDS logs
> > > > > before
> > > > >                 the above (i.e. "-60" and older) are just
> > > > > beacon
> > > > >                 messages and _check_auth_rotating checks.
> > > > > 
> > > > >                 I was wondering whether you have any
> > > > >                 recommendations in terms of what actions we
> > > > > could
> > > > >                 take to bring our filesystem back into a
> > > > > working
> > > > >                 state short of rebuilding the entire metadata
> > > > > pool?
> > > > >                 We are quite keen to bring our backup back
> > > > > into
> > > > >                 service urgently as we currently do not have
> > > > > any
> > > > >                 accessible backups for our Ceph clusters.
> > > > > 
> > > > >                 Kindest regards,
> > > > > 
> > > > >                 Ivan
> > > > > 
> > > > >                 On 25/06/2024 19:18, Dhairya Parmar wrote:
> > > > > >                 CAUTION: This email originated from outside
> > > > > > of the
> > > > > >                 LMB:
> > > > > >                 *.-dparmar@xxxxxxxxxx-.*
> > > > > >                 Do not click links or open attachments
> > > > > > unless you
> > > > > >                 recognize the sender and know the content
> > > > > > is safe.
> > > > > >                 If you think this is a phishing email,
> > > > > > please
> > > > > >                 forward it to phishing@xxxxxxxxxxxxxxxxx
> > > > > > 
> > > > > > 
> > > > > >                 --
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >                 On Tue, Jun 25, 2024 at 6:38 PM Ivan
> > > > > > Clayson
> > > > > >                 <ivan@xxxxxxxxxxxxxxxxx> wrote:
> > > > > > 
> > > > > >                     Hi Dhairya,
> > > > > > 
> > > > > >                     Thank you for your rapid reply. I tried
> > > > > >                     recovering the dentries for the file
> > > > > > just
> > > > > >                     before the crash I mentioned before and
> > > > > > then
> > > > > >                     splicing the transactions from the
> > > > > > journal
> > > > > >                     which seemed to remove that issue for
> > > > > > that
> > > > > >                     inode but resulted in the MDS crashing
> > > > > > on the
> > > > > >                     next inode in the journal when
> > > > > > performing replay.
> > > > > > 
> > > > > >                 The MDS delegates a range of preallocated
> > > > > > inodes
> > > > > >                 (in form of a set - interval_set<inodeno_t>
> > > > > >                 preallocated_inos) to the clients, so it
> > > > > > can be
> > > > > >                 one inode that is untracked or some inodes
> > > > > > from
> > > > > >                 the range or in worst case scenario - ALL,
> > > > > > and
> > > > > >                 this is something that even the
> > > > > >                 `cephfs-journal-tool` would not be able to
> > > > > > tell
> > > > > >                 (since we're talking about MDS internals
> > > > > > which
> > > > > >                 aren't exposed to such tools). That is the
> > > > > > reason
> > > > > >                 why you see "MDS crashing on the next inode
> > > > > > in the
> > > > > >                 journal when performing replay".
> > > > > > 
> > > > > >                 An option could be to expose the inode set
> > > > > > to some
> > > > > >                 tool or asok cmd to identify such inodes
> > > > > > ranges,
> > > > > >                 which needs to be discussed. For now, we're
> > > > > > trying
> > > > > >                 to address this in [0], you can follow the
> > > > > >                 discussion there.
> > > > > > 
> > > > > >                 [0] https://tracker.ceph.com/issues/66251
> > > > > > 
> > > > > >                     Removing all the transactions involving
> > > > > > the
> > > > > >                     directory housing the files that seemed
> > > > > > to
> > > > > >                     cause these crashes from the journal
> > > > > > only
> > > > > >                     caused the MDS to fail to even start
> > > > > > replay.
> > > > > > 
> > > > > >                     I've rolled back our journal to our
> > > > > > original
> > > > > >                     version when the crash first happened
> > > > > > and the
> > > > > >                     entire MDS log for the crash can be
> > > > > > found
> > > > > >                     here:
> > > > > >                    
> > > > > > https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24
> > > > > > 
> > > > > >                 Awesome, this would help us a ton. Apart
> > > > > > from
> > > > > >                 this, would it be possible to send us
> > > > > > client logs?
> > > > > > 
> > > > > >                     Please let us know if you would like
> > > > > > any other
> > > > > >                     logs file as we can easily induce this
> > > > > > crash.
> > > > > > 
> > > > > >                 Since you can easily induce the crash, can
> > > > > > you
> > > > > >                 share the reproducer please i.e. what all
> > > > > > action
> > > > > >                 you take in order to hit this?
> > > > > > 
> > > > > >                     Kindest regards,
> > > > > > 
> > > > > >                     Ivan
> > > > > > 
> > > > > >                     On 25/06/2024 09:58, Dhairya Parmar
> > > > > > wrote:
> > > > > > >                     CAUTION: This email originated from
> > > > > > > outside
> > > > > > >                     of the LMB:
> > > > > > >                     *.-dparmar@xxxxxxxxxx-.*
> > > > > > >                     Do not click links or open
> > > > > > > attachments unless
> > > > > > >                     you recognize the sender and know the
> > > > > > > content
> > > > > > >                     is safe.
> > > > > > >                     If you think this is a phishing
> > > > > > > email, please
> > > > > > >                     forward it to
> > > > > > > phishing@xxxxxxxxxxxxxxxxx
> > > > > > > 
> > > > > > > 
> > > > > > >                     --
> > > > > > > 
> > > > > > >                     Hi Ivan,
> > > > > > > 
> > > > > > >                     This looks to be similar to the issue
> > > > > > > [0]
> > > > > > >                     that we're already addressing at [1].
> > > > > > > So
> > > > > > >                     basically there is some out-of-sync
> > > > > > > event
> > > > > > >                     that led the client to make use of
> > > > > > > the inodes
> > > > > > >                     that MDS wasn't aware of/isn't
> > > > > > > tracking and
> > > > > > >                     hence the crash. It'd be really
> > > > > > > helpful if
> > > > > > >                     you can provide us more logs.
> > > > > > > 
> > > > > > >                     CC @Rishabh Dave
> > > > > > > <mailto:ridave@xxxxxxxxxx>
> > > > > > >                     @Venky Shankar
> > > > > > > <mailto:vshankar@xxxxxxxxxx>
> > > > > > >                     @Patrick Donnelly
> > > > > > >                     <mailto:pdonnell@xxxxxxxxxx> @Xiubo
> > > > > > > Li
> > > > > > >                     <mailto:xiubli@xxxxxxxxxx>
> > > > > > > 
> > > > > > >                     [0]
> > > > > > > https://tracker.ceph.com/issues/61009
> > > > > > >                     [1]
> > > > > > > https://tracker.ceph.com/issues/66251
> > > > > > >                     --
> > > > > > >                     ***Dhairya Parmar*
> > > > > > > 
> > > > > > >                     Associate Software Engineer, CephFS
> > > > > > > 
> > > > > > >                     <https://www.redhat.com/>IBM, Inc.
> > > > > > > 
> > > > > > > 
> > > > > > >                     On Mon, Jun 24, 2024 at 8:54 PM Ivan
> > > > > > > Clayson
> > > > > > >                     <ivan@xxxxxxxxxxxxxxxxx> wrote:
> > > > > > > 
> > > > > > >                         Hello,
> > > > > > > 
> > > > > > >                         We have been experiencing a
> > > > > > > serious issue
> > > > > > >                         with our CephFS backup cluster
> > > > > > >                         running quincy (version 17.2.7)
> > > > > > > on a
> > > > > > >                         RHEL8-derivative Linux kernel
> > > > > > >                         (Alma8.9, 4.18.0-513.9.1 kernel)
> > > > > > > where
> > > > > > >                         our MDSes for our filesystem are
> > > > > > >                         constantly in a "replay" or
> > > > > > >                         "replay(laggy)" state and keep
> > > > > > > crashing.
> > > > > > > 
> > > > > > >                         We have a single MDS filesystem
> > > > > > > called
> > > > > > >                         "ceph_backup" with 2 standby
> > > > > > >                         MDSes along with a 2nd unused
> > > > > > > filesystem
> > > > > > >                         "ceph_archive" (this holds
> > > > > > >                         little to no data) where we are
> > > > > > > using our
> > > > > > >                         "ceph_backup" filesystem to
> > > > > > >                         backup our data and this is the
> > > > > > > one which
> > > > > > >                         is currently broken. The Ceph
> > > > > > >                         health outputs currently are:
> > > > > > > 
> > > > > > >                         root@pebbles-s1 14:05 [~]: ceph -
> > > > > > > s
> > > > > > >                                cluster:
> > > > > > >                                  id:
> > > > > > >                         e3f7535e-d35f-4a5d-88f0-
> > > > > > > a1e97abcd631
> > > > > > >                                  health: HEALTH_WARN
> > > > > > >                                          1 filesystem is
> > > > > > > degraded
> > > > > > >                         insufficient standby MDS daemons
> > > > > > > available
> > > > > > >                         1319 pgs not deep-scrubbed in
> > > > > > > time
> > > > > > >                         1054 pgs not scrubbed in time
> > > > > > > 
> > > > > > >                                services:
> > > > > > >                                  mon: 4 daemons, quorum
> > > > > > >                         pebbles-s1,pebbles-s2,pebbles-
> > > > > > > s3,pebbles-s4
> > > > > > >                         (age 36m)
> > > > > > >                                  mgr: pebbles-s2(active,
> > > > > > > since
> > > > > > >                         36m), standbys: pebbles-s4,
> > > > > > >                             pebbles-s3, pebbles-s1
> > > > > > >                                  mds: 2/2 daemons up
> > > > > > >                                  osd: 1380 osds: 1380 up
> > > > > > > (since
> > > > > > >                         29m), 1379 in (since 3d); 37
> > > > > > >                             remapped pgs
> > > > > > > 
> > > > > > >                                data:
> > > > > > >                                  volumes: 1/2 healthy, 1
> > > > > > > recovering
> > > > > > >                                  pools: 7 pools, 2177 pgs
> > > > > > >                                  objects: 3.55G objects,
> > > > > > > 7.0 PiB
> > > > > > >                                  usage: 8.9 PiB used, 14
> > > > > > > PiB / 23
> > > > > > >                         PiB avail
> > > > > > >                                  pgs:
> > > > > > > 83133528/30006841533
> > > > > > >                         objects misplaced (0.277%)
> > > > > > >                         2090 active+clean
> > > > > > >                         47 active+clean+scrubbing+deep
> > > > > > >                         29 active+remapped+backfilling
> > > > > > >                         8 active+remapped+backfill_wait
> > > > > > >                         2 active+clean+scrubbing
> > > > > > >                         1 active+clean+snaptrim
> > > > > > > 
> > > > > > >                                io:
> > > > > > >                                  recovery: 1.9 GiB/s, 719
> > > > > > > objects/s
> > > > > > > 
> > > > > > >                         root@pebbles-s1 14:09 [~]: ceph
> > > > > > > fs status
> > > > > > >                             ceph_backup - 0 clients
> > > > > > >                             ===========
> > > > > > >                             RANK STATE MDS ACTIVITY   DNS
> > > > > > > INOS  
> > > > > > >                         DIRS CAPS
> > > > > > >                               0 replay(laggy) pebbles-s3
> > > > > > > 0      0
> > > > > > >                         0      0
> > > > > > >                         POOL TYPE     USED AVAIL
> > > > > > >                         mds_backup_fs metadata  1255G
> > > > > > > 2780G
> > > > > > >                         ec82_primary_fs_data data       0
> > > > > > > 2780G
> > > > > > >                         ec82pool data    8442T 3044T
> > > > > > >                             ceph_archive - 2 clients
> > > > > > >                             ============
> > > > > > >                             RANK STATE MDS ACTIVITY
> > > > > > > DNS    INOS  
> > > > > > >                         DIRS CAPS
> > > > > > >                               0    active pebbles-s2
> > > > > > > Reqs:    0
> > > > > > >                         /s 13.4k  7105    118 2
> > > > > > >                         POOL TYPE     USED AVAIL
> > > > > > >                         mds_archive_fs metadata  5184M
> > > > > > > 2780G
> > > > > > >                         ec83_primary_fs_data data       0
> > > > > > > 2780G
> > > > > > >                         ec83pool data     138T 2767T
> > > > > > >                             MDS version: ceph version
> > > > > > > 17.2.7
> > > > > > >                        
> > > > > > > (b12291d110049b2f35e32e0de30d70e9a4c060d2)
> > > > > > >                         quincy (stable)
> > > > > > >                         root@pebbles-s1 14:09 [~]: ceph
> > > > > > > health
> > > > > > >                         detail | head
> > > > > > >                             HEALTH_WARN 1 filesystem is
> > > > > > > degraded;
> > > > > > >                         insufficient standby MDS
> > > > > > >                             daemons available; 1319 pgs
> > > > > > > not
> > > > > > >                         deep-scrubbed in time; 1054 pgs
> > > > > > > not
> > > > > > >                             scrubbed in time
> > > > > > >                             [WRN] FS_DEGRADED: 1
> > > > > > > filesystem is
> > > > > > >                         degraded
> > > > > > >                                  fs ceph_backup is
> > > > > > > degraded
> > > > > > >                             [WRN]
> > > > > > > MDS_INSUFFICIENT_STANDBY:
> > > > > > >                         insufficient standby MDS daemons
> > > > > > >                             available
> > > > > > >                                  have 0; want 1 more
> > > > > > > 
> > > > > > >                         When our cluster first ran after
> > > > > > > a
> > > > > > >                         reboot, Ceph ran through the 2
> > > > > > >                         standby MDSes, crashing them all,
> > > > > > > until
> > > > > > >                         it reached the final MDS and is
> > > > > > >                         now stuck in this "replay(laggy)"
> > > > > > > state.
> > > > > > >                         Putting our MDSes into
> > > > > > >                         debugging mode, we can see that
> > > > > > > this MDS
> > > > > > >                         crashed when replaying the
> > > > > > >                         journal for a particular inode
> > > > > > > (this is
> > > > > > >                         the same for all the MDSes and
> > > > > > >                         they all crash on the same
> > > > > > > object):
> > > > > > > 
> > > > > > >                             ...
> > > > > > >                         2024-06-24T13:44:55.563+0100
> > > > > > > 7f8811c40700
> > > > > > >                         10 mds.0.journal
> > > > > > >                         EMetaBlob.replay for [521,head]
> > > > > > > had
> > > > > > >                         [inode 0x1005ba89481
> > > > > > >                             [...539,head]
> > > > > > >                         /cephfs-
> > > > > > > users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3
> > > > > > > /cryolo/test_micrographs/
> > > > > > >                             auth fragtree_t(*^2 00*^3
> > > > > > > 00000*^
> > > > > > >                             4 00001*^3 00010*^4 00011*^4
> > > > > > > 00100*^4
> > > > > > >                         00101*^4 00110*^4 00111*^4
> > > > > > >                             01*^3 01000*^4 01001*^3
> > > > > > > 01010*^4
> > > > > > >                         01011*^3 01100*^4 01101*^4
> > > > > > > 01110*^4
> > > > > > >                             01111*^4 10*^3 10000*^4
> > > > > > > 10001*^4
> > > > > > >                         10010*^4 10011*^4 10100*^4
> > > > > > > 10101*^3
> > > > > > >                             10110*^4 10111*^4 11*^6)
> > > > > > > v10880645
> > > > > > >                         f(v0 m2024-06-22
> > > > > > >                         T05:41:10.213700+0100
> > > > > > > 1281276=1281276+0)
> > > > > > >                         n(v12
> > > > > > >                         rc2024-06-22T05:41:10.213700+0100
> > > > > > >                         b1348251683896 1281277=1281276+1)
> > > > > > >                             old_inodes=8 (iversion lock)
> > > > > > > |
> > > > > > >                         dirfrag=416 dirty=1
> > > > > > > 0x55770a2bdb80]
> > > > > > >                         2024-06-24T13:44:55.563+0100
> > > > > > > 7f8811c40700
> > > > > > >                         10 mds.0.journal
> > > > > > >                         EMetaBlob.replay dir
> > > > > > > 0x1005ba89481.011011000*
> > > > > > >                         2024-06-24T13:44:55.563+0100
> > > > > > > 7f8811c40700
> > > > > > >                         10 mds.0.journal
> > > > > > >                         EMetaBlob.replay updated dir [dir
> > > > > > >                         0x1005ba89481.011011000*
> > > > > > >                         /cephfs-
> > > > > > > users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3
> > > > > > > /cryolo/test_micrographs/
> > > > > > >                             [2,head] auth v=436385 cv=0/0
> > > > > > >                         state=107374182
> > > > > > >                             4 f(v0
> > > > > > >                         m2024-06-22T05:41:10.213700+0100
> > > > > > >                         2502=2502+0) n(v12
> > > > > > >                         rc2024-06-22T05:41:10.213700+0100
> > > > > > >                         b2120744220 2502=2502+0)
> > > > > > >                         hs=32+33,ss=0+0 dirty=65 |
> > > > > > > child=1
> > > > > > >                         0x55770ebcda80]
> > > > > > >                         2024-06-24T13:44:55.563+0100
> > > > > > > 7f8811c40700
> > > > > > >                         10 mds.0.journal
> > > > > > >                         EMetaBlob.replay added (full)
> > > > > > > [dentry
> > > > > > >                         #0x1/cephfs-
> > > > > > > users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3
> > > > > > > /cryolo/test_micrographs/FoilHole_27649821_Data_27626128_
> > > > > > > 27626130_20210628_005006_fracti
> > > > > > >                         ons_ave_Z124.mrc.teberet7.partial
> > > > > > >                         [539,head] auth NULL (dversion
> > > > > > >                             lock) v=436384 ino=(nil)
> > > > > > >                         state=1610612800|bottomlru |
> > > > > > > dirty=1
> > > > > > >                         0x557710444500]
> > > > > > >                         2024-06-24T13:44:55.563+0100
> > > > > > > 7f8811c40700
> > > > > > >                         10 mds.0.journal
> > > > > > >                         EMetaBlob.replay added [inode
> > > > > > >                         0x1005cd4fe35 [539,head]
> > > > > > >                         /cephfs-
> > > > > > > users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3
> > > > > > > /cryolo/test_micrographs/FoilHole_27649821_Data_27626128_
> > > > > > > 27626130_20210628_
> > > > > > >                        
> > > > > > > 005006_fractions_ave_Z124.mrc.teberet7.partial
> > > > > > >                         auth v436384 s=0 n(v0
> > > > > > >                             1=1+0) (iversion lock)
> > > > > > >                         cr={99995144=0-4194304@538}
> > > > > > > 0x557710438680]
> > > > > > >                         2024-06-24T13:44:55.563+0100
> > > > > > > 7f8811c40700 10
> > > > > > >                         mds.0.cache.ino(0x1005cd4fe35)
> > > > > > >                         mark_dirty_parent
> > > > > > >                         2024-06-24T13:44:55.563+0100
> > > > > > > 7f8811c40700
> > > > > > >                         10 mds.0.journal
> > > > > > >                         EMetaBlob.replay noting opened
> > > > > > > inode
> > > > > > >                         [inode 0x1005cd4fe35 [539,head]
> > > > > > >                         /cephfs-
> > > > > > > users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3
> > > > > > > /cryolo/test_micrographs/FoilHole_27649821_Data_27626128_
> > > > > > > 2762
> > > > > > >                        
> > > > > > > 6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.part
> > > > > > > ial
> > > > > > >                         auth
> > > > > > >                             v436384 DIRTYPARENT s=0 n(v0
> > > > > > > 1=1+0)
> > > > > > >                         (iversion lock)
> > > > > > >                         cr={99995144=0-4194304@538} |
> > > > > > >                         dirtyparent=1 dirty=1
> > > > > > > 0x557710438680]
> > > > > > >                         2024-06-24T13:44:55.563+0100
> > > > > > > 7f8811c40700
> > > > > > >                         10 mds.0.journal
> > > > > > >                         EMetaBlob.replay inotable tablev
> > > > > > > 3112837
> > > > > > >                         <= table 3112837
> > > > > > >                         2024-06-24T13:44:55.563+0100
> > > > > > > 7f8811c40700
> > > > > > >                         10 mds.0.journal
> > > > > > >                         EMetaBlob.replay sessionmap v
> > > > > > > 1560540883,
> > > > > > >                         table 1560540882 prealloc
> > > > > > >                             [] used 0x1005cd4fe35
> > > > > > >                         2024-06-24T13:44:55.563+0100
> > > > > > > 7f8811c40700 -1
> > > > > > >                         /home/jenkins-
> > > > > > > build/build/workspace/ceph-
> > > > > > > build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/ce
> > > > > > > ntos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/r
> > > > > > > pm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
> > > > > > >                             I
> > > > > > >                             n function 'void
> > > > > > > interval_set<T,
> > > > > > >                         C>::erase(T, T,
> > > > > > >                         std::function<bool(T, T)>) [with
> > > > > > > T =
> > > > > > >                         inodeno_t; C = std::map]'
> > > > > > >                             thread 7f8811c40700 time
> > > > > > >                         2024-06-24T13:44:55.564315+0100
> > > > > > >                         /home/jenkins-
> > > > > > > build/build/workspace/ceph-
> > > > > > > build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/ce
> > > > > > > ntos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/r
> > > > > > > pm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
> > > > > > >                             568: FAILED ceph_assert(p-
> > > > > > > >first <=
> > > > > > >                         start)
> > > > > > > 
> > > > > > >                               ceph version 17.2.7
> > > > > > >                        
> > > > > > > (b12291d110049b2f35e32e0de30d70e9a4c060d2)
> > > > > > >                             quincy (stable)
> > > > > > >                               1:
> > > > > > > (ceph::__ceph_assert_fail(char
> > > > > > >                         const*, char const*, int, char
> > > > > > >                             const*)+0x135)
> > > > > > > [0x7f8821e814a3]
> > > > > > >                               2:
> > > > > > >                         /usr/lib64/ceph/libceph-
> > > > > > > common.so.2(+0x269669)
> > > > > > >                         [0x7f8821e81669]
> > > > > > >                               3: (interval_set<inodeno_t,
> > > > > > >                         std::map>::erase(inodeno_t,
> > > > > > > inodeno_t,
> > > > > > >                         std::function<bool (inodeno_t,
> > > > > > >                         inodeno_t)>)+0x2e5)
> > > > > > > [0x5576f9bb2885]
> > > > > > >                               4:
> > > > > > > (EMetaBlob::replay(MDSRank*,
> > > > > > >                         LogSegment*, int,
> > > > > > >                         MDPeerUpdate*)+0x4377)
> > > > > > > [0x5576f9eb77b7]
> > > > > > >                               5:
> > > > > > > (EUpdate::replay(MDSRank*)+0x61)
> > > > > > >                         [0x5576f9ebbbd1]
> > > > > > >                               6:
> > > > > > > (MDLog::_replay_thread()+0x7bb)
> > > > > > >                         [0x5576f9e4254b]
> > > > > > >                               7:
> > > > > > >                        
> > > > > > > (MDLog::ReplayThread::entry()+0x11)
> > > > > > >                         [0x5576f9af5041]
> > > > > > >                               8:
> > > > > > > /lib64/libpthread.so.0(+0x81ca)
> > > > > > >                         [0x7f8820e6f1ca]
> > > > > > >                               9: clone()
> > > > > > > 
> > > > > > >                         I've only included a short
> > > > > > > section of the
> > > > > > >                         crash (this is the first trace
> > > > > > >                         in the log with regards to the
> > > > > > > crash with
> > > > > > >                         a 10/20 debug_mds option). We
> > > > > > >                         tried deleting the 0x1005cd4fe35
> > > > > > > object
> > > > > > >                         from the object store using the
> > > > > > >                         "rados" command but this did not
> > > > > > > allow
> > > > > > >                         our MDS to successfully replay.
> > > > > > > 
> > > > > > >                          From my understanding the
> > > > > > > journal seems
> > > > > > >                         okay as we didn't run out of
> > > > > > >                         space for example on our metadata
> > > > > > > pool
> > > > > > >                         and "cephfs-journal-tool journal
> > > > > > >                         inspect" doesn't seem to think
> > > > > > > there is
> > > > > > >                         any damage:
> > > > > > > 
> > > > > > >                         root@pebbles-s1 13:58 [~]:
> > > > > > >                         cephfs-journal-tool --
> > > > > > > rank=ceph_backup:0
> > > > > > >                             journal inspect
> > > > > > >                             Overall journal integrity: OK
> > > > > > >                         root@pebbles-s1 14:04 [~]:
> > > > > > >                         cephfs-journal-tool --
> > > > > > > rank=ceph_backup:0
> > > > > > >                             event get --inode
> > > > > > > 1101069090357 summary
> > > > > > >                             Events by type:
> > > > > > >                                OPEN: 1
> > > > > > >                                UPDATE: 3
> > > > > > >                             Errors: 0
> > > > > > >                         root@pebbles-s1 14:05 [~]:
> > > > > > >                         cephfs-journal-tool --
> > > > > > > rank=ceph_backup:0
> > > > > > >                             event get --inode
> > > > > > > 1101069090357 list
> > > > > > >                         2024-06-22T05:41:10.214635+0100
> > > > > > >                         0x51f97d4cfe35 UPDATE:  (openc)
> > > > > > >                        
> > > > > > > test_micrographs/FoilHole_27649821_Data_27626128_27626130
> > > > > > > _20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
> > > > > > >                         2024-06-22T05:41:11.203312+0100
> > > > > > >                         0x51f97d59c848 UPDATE:
> > > > > > >                         (check_inode_max_size)
> > > > > > >                        
> > > > > > > test_micrographs/FoilHole_27649821_Data_27626128_27626130
> > > > > > > _20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
> > > > > > >                        
> > > > > > > test_micrographs/FoilHole_27649821_Data_27626128_27626130
> > > > > > > _20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
> > > > > > >                         2024-06-22T05:41:15.484871+0100
> > > > > > >                         0x51f97e7344cc OPEN:  ()
> > > > > > >                        
> > > > > > > FoilHole_27649821_Data_27626128_27626130_20210628_005006_
> > > > > > > fractions_ave_Z124.mrc.teberet7.partial
> > > > > > >                         2024-06-22T05:41:15.484921+0100
> > > > > > >                         0x51f97e73493b UPDATE:  (rename)
> > > > > > >                        
> > > > > > > test_micrographs/FoilHole_27649821_Data_27626128_27626130
> > > > > > > _20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
> > > > > > >                        
> > > > > > > test_micrographs/FoilHole_27649821_Data_27626128_27626130
> > > > > > > _20210628_005006_fractions_ave_Z124.mrc
> > > > > > > 
> > > > > > >                         I was wondering whether anyone
> > > > > > > had any
> > > > > > >                         advice for us on how we should
> > > > > > >                         proceed forward? We were thinking
> > > > > > > about
> > > > > > >                         manually applying these events
> > > > > > >                         (via "event apply") where failing
> > > > > > > that we
> > > > > > >                         could erase this problematic
> > > > > > >                         event with "cephfs-journal-tool
> > > > > > >                         --rank=ceph_backup:0 event splice
> > > > > > >                         --inode 1101069090357". Is this a
> > > > > > > good
> > > > > > >                         idea? We would rather not rebuild
> > > > > > >                         the entire metadata pool if we
> > > > > > > could
> > > > > > >                         avoid it (once was enough for us)
> > > > > > >                         as this cluster has ~9 PB of data
> > > > > > > on it.
> > > > > > > 
> > > > > > >                         Kindest regards,
> > > > > > > 
> > > > > > >                         Ivan Clayson
> > > > > > > 
> > > > > > >                         -- 
> > > > > > >                         Ivan Clayson
> > > > > > >                         -----------------
> > > > > > >                         Scientific Computing Officer
> > > > > > >                         Room 2N249
> > > > > > >                         Structural Studies
> > > > > > >                         MRC Laboratory of Molecular
> > > > > > > Biology
> > > > > > >                         Francis Crick Ave, Cambridge
> > > > > > >                         CB2 0QH
> > > > > > >                        
> > > > > > > _______________________________________________
> > > > > > >                         ceph-users mailing list --
> > > > > > > ceph-users@xxxxxxx
> > > > > > >                         To unsubscribe send an email to
> > > > > > >                         ceph-users-leave@xxxxxxx
> > > > > > > 
> > > > > >                     -- 
> > > > > >                     Ivan Clayson
> > > > > >                     -----------------
> > > > > >                     Scientific Computing Officer
> > > > > >                     Room 2N249
> > > > > >                     Structural Studies
> > > > > >                     MRC Laboratory of Molecular Biology
> > > > > >                     Francis Crick Ave, Cambridge
> > > > > >                     CB2 0QH
> > > > > > 
> > > > >                 -- 
> > > > >                 Ivan Clayson
> > > > >                 -----------------
> > > > >                 Scientific Computing Officer
> > > > >                 Room 2N249
> > > > >                 Structural Studies
> > > > >                 MRC Laboratory of Molecular Biology
> > > > >                 Francis Crick Ave, Cambridge
> > > > >                 CB2 0QH
> > > > > 
> > > >             -- 
> > > >             Ivan Clayson
> > > >             -----------------
> > > >             Scientific Computing Officer
> > > >             Room 2N249
> > > >             Structural Studies
> > > >             MRC Laboratory of Molecular Biology
> > > >             Francis Crick Ave, Cambridge
> > > >             CB2 0QH
> > > > 
> > >         -- 
> > >         Ivan Clayson
> > >         -----------------
> > >         Scientific Computing Officer
> > >         Room 2N249
> > >         Structural Studies
> > >         MRC Laboratory of Molecular Biology
> > >         Francis Crick Ave, Cambridge
> > >         CB2 0QH
> > > 
> >     -- 
> >     Ivan Clayson
> >     -----------------
> >     Scientific Computing Officer
> >     Room 2N249
> >     Structural Studies
> >     MRC Laboratory of Molecular Biology
> >     Francis Crick Ave, Cambridge
> >     CB2 0QH
> > 
> -- 
> Ivan Clayson
> -----------------
> Scientific Computing Officer
> Room 2N249
> Structural Studies
> MRC Laboratory of Molecular Biology
> Francis Crick Ave, Cambridge
> CB2 0QH
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx