Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> · Wed, 10 Jul 2024 10:08:24 +0100

Hi Tim,

Alma8's active support ended in May this year and henceforth there are 
only security updates. But you make a good point and we are moving 
toward Alma9 very shortly!

Whilst we're mentioning distributions, we've had quite a good experience 
with Alma (notwithstanding our current but unrelated troubles) and we 
would recommend it.

Kindest regards,

Ivan

On 09/07/2024 16:19, Tim Holloway wrote:
CAUTION: This email originated from outside of the LMB:
.-timh@xxxxxxxxxxxxx-.
Do not click links or open attachments unless you recognize the sender and know the content is safe.
If you think this is a phishing email, please forward it to phishing@xxxxxxxxxxxxxxxxx
--

Ivan,

This may be a little off-topic, but if you're still running AlmaLinux
8,9, it's worth noting that CentOS 8 actually end-of-lifed about 2
years ago, thanks to CentOS Stream.

Up until this last week, however, I had several AlmaLinux 8 machines
running myself, but apparently somewhere around May IBM Red Hat pulled
all of its CentOS8 enterprise sites offline, including Storage and
Ceph, which broke my yum updates.

While as far as I'm aware, once you've installed cephadm (whether via
yum/dnf or otherwise), there's no further need for the RPM repos,
losing yum support is not helping at the very least.

On the upside, it's possible to upgrade-in-place from AlmaLinux 8.9 to
AlmaLinux 9, although it may require temporarily disabling certain OS
services to appease the upgrade process.

Probably won't solve your problem, but at least you'll be able to move
fairly painlessly to a better-supported platform.

   Best Regards,
      Tim

On Tue, 2024-07-09 at 11:14 +0100, Ivan Clayson wrote:
Hi Dhairya,

I would be more than happy to try and give as many details as
possible
but the slack channel is private and requires my email to have an
account/ access to it.

Wouldn't taking the discussion about this error to a private channel
also stop other users who experience this error from learning about
how
and why this happened as  well as possibly not be able to view the
solution? Would it not be possible to discuss this more publicly for
the
benefit of the other users on the mailing list?

Kindest regards,

Ivan

On 09/07/2024 10:44, Dhairya Parmar wrote:
CAUTION: This email originated from outside of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize the
sender
and know the content is safe.
If you think this is a phishing email, please forward it to
phishing@xxxxxxxxxxxxxxxxx

--

Hey Ivan,

This is a relatively new MDS crash, so this would require some
investigation but I was instructed to recommend disaster-recovery
steps [0] (except session reset) to you to get the FS up again.

This crash is being discussed on upstream CephFS slack channel [1]
with @Venky Shankar <mailto:vshankar@xxxxxxxxxx> and other CephFS
devs. I'd encourage you to join the conversation, we can discuss
this
in detail and maybe go through the incident step by step which
should
help analyse the crash better.

[0]
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
[1]
https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519

On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson
<ivan@xxxxxxxxxxxxxxxxx>
wrote:

     Hi Dhairya,

     Thank you ever so much for having another look at this so
quickly.
     I don't think I have any logs similar to the ones you
referenced
     this time as my MDSs don't seem to enter the replay stage when
     they crash (or at least don't now after I've thrown the logs
away)
     but those errors do crop up in the prior logs I shared when the
     system first crashed.

     Kindest regards,

     Ivan

     On 08/07/2024 14:08, Dhairya Parmar wrote:
     CAUTION: This email originated from outside of the LMB:
     *.-dparmar@xxxxxxxxxx-.*
     Do not click links or open attachments unless you recognize
the
     sender and know the content is safe.
     If you think this is a phishing email, please forward it to
     phishing@xxxxxxxxxxxxxxxxx

     --

     Ugh, something went horribly wrong. I've downloaded the MDS
logs
     that contain assertion failure and it looks relevant to this
[0].
     Do you have client logs for this?

     The other log that you shared is being downloaded right now,
once
     that's done and I'm done going through it, I'll update you.

     [0] https://tracker.ceph.com/issues/54546

     On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson
     <ivan@xxxxxxxxxxxxxxxxx> wrote:

         Hi Dhairya,

         Sorry to resurrect this thread again, but we still
         unfortunately have an issue with our filesystem after we
         attempted to write new backups to it.

         We finished the scrub of the filesystem on Friday and ran
a
         repair scrub on the 1 directory which had metadata
damage.
         After doing so and rebooting, the cluster reported no
issues
         and data was accessible again.

         We re-started the backups to run over the weekend and
         unfortunately the filesystem crashed again where the log
of
         the failure is here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz
.
         We ran the backups on kernel mounts of the filesystem
without
         the nowsync option this time to avoid the out-of-sync
write
         problems..

         I've tried resetting the journal again after recovering
the
         dentries but unfortunately the filesystem is still in a
         failed state despite setting joinable to true. The log of
         this crash is here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708
.

         I'm not sure how to proceed as I can't seem to get any
MDS to
         take over the first rank. I would like to do a scrub of
the
         filesystem and preferably overwrite the troublesome files
         with the originals on the live filesystem. Do you have
any
         advice on how to make the filesystem leave its failed
state?
         I have a backup of the journal before I reset it so I can
         roll back if necessary.

         Here are some details about the filesystem at present:

             root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
               cluster:
                 id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
                 health: HEALTH_ERR
                         1 filesystem is degraded
                         1 large omap objects
                         1 filesystem is offline
                         1 mds daemon damaged
             nobackfill,norebalance,norecover,noscrub,nodeep-
scrub,nosnaptrim
             flag(s) set
                         1750 pgs not deep-scrubbed in time
                         1612 pgs not scrubbed in time

               services:
                 mon: 4 daemons, quorum
             pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
                 mgr: pebbles-s2(active, since 77m), standbys:
             pebbles-s1, pebbles-s3, pebbles-s4
                 mds: 1/2 daemons up, 3 standby
                 osd: 1380 osds: 1380 up (since 76m), 1379 in
(since
             10d); 10 remapped pgs
                      flags
             nobackfill,norebalance,norecover,noscrub,nodeep-
scrub,nosnaptrim

               data:
                 volumes: 1/2 healthy, 1 recovering; 1 damaged
                 pools:   7 pools, 2177 pgs
                 objects: 3.24G objects, 6.7 PiB
                 usage:   8.6 PiB used, 14 PiB / 23 PiB avail
                 pgs:     11785954/27384310061 objects misplaced
(0.043%)
                          2167 active+clean
                          6    active+remapped+backfilling
                          4    active+remapped+backfill_wait

             ceph_backup - 0 clients
             ===========
             RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS CAPS
              0 failed
                     POOL            TYPE     USED  AVAIL
                mds_backup_fs      metadata  1174G  3071G
             ec82_primary_fs_data    data       0   3071G
                   ec82pool          data    8085T  4738T
             ceph_archive - 2 clients
             ============
             RANK  STATE      MDS         ACTIVITY     DNS INOS
             DIRS   CAPS
              0    active  pebbles-s4  Reqs:    0 /s  13.4k
7105
             118      2
                     POOL            TYPE     USED  AVAIL
                mds_archive_fs     metadata  5184M  3071G
             ec83_primary_fs_data    data       0   3071G
                   ec83pool          data     138T  4307T
             STANDBY MDS
              pebbles-s2
              pebbles-s3
              pebbles-s1
             MDS version: ceph version 17.2.7
             (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
(stable)
             root@pebbles-s2 11:55 [~]: ceph fs dump
             e2643889
             enable_multiple, ever_enabled_multiple: 1,1
             default compat:
compat={},rocompat={},incompat={1=base
             v0.20,2=client writeable ranges,3=default file
layouts on
             dirs,4=dir inode in separate object,5=mds uses
versioned
             encoding,6=dirfrag is stored in omap,8=no anchor
             table,9=file layout v2,10=snaprealm v2}
             legacy client fscid: 1

             Filesystem 'ceph_backup' (1)
             fs_name    ceph_backup
             epoch    2643888
             flags    12 joinable allow_snaps allow_multimds_snaps
             created    2023-05-19T12:52:36.302135+0100
             modified    2024-07-08T11:17:55.437861+0100
             tableserver    0
             root    0
             session_timeout    60
             session_autoclose    300
             max_file_size    109934182400000
             required_client_features    {}
             last_failure    0
             last_failure_osd_epoch    494515
             compat    compat={},rocompat={},incompat={1=base
             v0.20,2=client writeable ranges,3=default file
layouts on
             dirs,4=dir inode in separate object,5=mds uses
versioned
             encoding,6=dirfrag is stored in omap,7=mds uses
inline
             data,8=no anchor table,9=file layout v2,10=snaprealm
v2}
             max_mds    1
             in    0
             up    {}
             failed
             damaged    0
             stopped
             data_pools    [6,3]
             metadata_pool    2
             inline_data    disabled
             balancer
             standby_count_wanted    1

         Kindest regards,

         Ivan

         On 28/06/2024 15:17, Dhairya Parmar wrote:
         CAUTION: This email originated from outside of the LMB:
         *.-dparmar@xxxxxxxxxx-.*
         Do not click links or open attachments unless you
recognize
         the sender and know the content is safe.
         If you think this is a phishing email, please forward
it to
         phishing@xxxxxxxxxxxxxxxxx

         --

         On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson
         <ivan@xxxxxxxxxxxxxxxxx> wrote:

             Hi Dhairya,

             I would be more than happy to share our corrupted
             journal. Has the host key changed for drop.ceph.com
             <http://drop.ceph.com>? The fingerprint I'm being
sent
             is 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4
which is
             different to the one in our
             /usr/share/ceph/known_hosts_drop.ceph.com
             <http://known_hosts_drop.ceph.com>.

         Ah, strange. Let me get in touch with folks who might
know
         about this, will revert back to you ASAP

             Thank you for your advice as well. We've reset our
MDS'
             journal and are currently in the process of a full
             filesystem scrub which understandably is taking
quite a
             bit of time but seems to be progressing through the
             objects fine.

         YAY!

             Thank you ever so much for all your help and please
do
             feel free to follow up with us if you would like
any
             further details about our crash!

         Glad to hear it went well, this bug is being worked on
with
         high priority and once the patch is ready, it will be
         backported.

         The root cause of this issue is the `nowsync` (async
dirops)
         being enabled by default with kclient [0]. This feature
         allows asynchronous creation and deletion of files,
         optimizing performance by avoiding round-trip latency
for
         these system calls. However, in very rare cases (like
yours
         :D), it can affect the system's consistency and
stability
         hence if this kind of optimization is not a priority
for
         your workload, I recommend turning it off by switching
         the mount points to `wsync` and also set the MDS config
         `mds_client_delegate_inos_pct` to `0` so that you don't
end
         up in this situation again (until the bug fix arrives
:)).

         [0]

https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902

             Kindest regards,

             Ivan

             On 27/06/2024 12:39, Dhairya Parmar wrote:
             CAUTION: This email originated from outside of
the LMB:
             *.-dparmar@xxxxxxxxxx-.*
             Do not click links or open attachments unless you
             recognize the sender and know the content is
safe.
             If you think this is a phishing email, please
forward
             it to phishing@xxxxxxxxxxxxxxxxx

             --

             Hi Ivan,

             The solution (which has been successful for us in
the
             past) is to reset the journal. This would bring
the fs
             back online and return the MDSes to a stable
state, but
             some data would be lost—the data in the journal
that
             hasn't been flushed to the backing store would be
gone.
             Therefore, you should try to flush out as much
journal
             data as possible before resetting the journal.

             Here are the steps for this entire process:

             1) Bring the FS offline
             $ ceph fs fail <fs_name>

             2) Recover dentries from journal (run it with
every MDS
             Rank)
             $ cephfs-journal-tool --rank=<fs_name>:<mds-rank>
event
             recover_dentries summary

             3) Reset the journal (again with every MDS Rank)
             $ cephfs-journal-tool --rank=<fs_name>:<mds-rank>
             journal reset

             4) Bring the FS online
             $ cephfs fs set <fs_name> joinable true

             5) Restart the MDSes

             6) Perform scrub to ensure consistency of fs
             $ ceph tell mds.<fs_name>:0 scrub start <path>
             [scrubopts] [tag]
             # you could try a recursive scrub maybe `ceph
tell
             mds.<fs_name>:0 scrub start / recursive`

             Some important notes to keep in mind:
             * Recovering dentries will take time (generally,
rank 0
             is the most time-consuming, but the rest should
be quick).
             * cephfs-journal-tool and metadata OSDs are bound
to
             use a significant CPU percentage. This is because
             cephfs-journal-tool has to swig the journal data
and
             flush it out to the backing store, which also
makes the
             metadata operations go rampant, resulting in OSDs
             taking a significant percentage of CPU.

             Do let me know how this goes.

             On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson
             <ivan@xxxxxxxxxxxxxxxxx> wrote:

                 Hi Dhairya,

                 We can induce the crash by simply restarting
the
                 MDS and the crash seems to happen when an MDS
goes
                 from up:standby to up:replay. The MDS works
through
                 a few files in the log before eventually
crashing
                 where I've included the logs for this here
(this is
                 after I imported the backed up journal which
I hope
                 was successful but please let me know if you
                 suspect it wasn't!):

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log

                 With respect to the client logs, are you
referring
                 to the clients who are writing to the
filesystem?
                 We don't typically run them in any sort of
debug
                 mode and we have quite a few machines running
our
                 backup system but we can look an hour or so
before
                 the first MDS crash (though I don't know if
this is
                 when the de-sync occurred). Here are some MDS
logs
                 with regards to the initial crash on Saturday
                 morning though which may be helpful:

                        -59> 2024-06-22T05:41:43.090+0100
                     7f184ce82700 10 monclient: tick
                        -58> 2024-06-22T05:41:43.090+0100
                     7f184ce82700 10 monclient:
_check_auth_rotating
                     have uptodate secrets (they expire after
                     2024-06-22T05:41:13.091556+0100)
                        -57> 2024-06-22T05:41:43.208+0100
                     7f184de84700  1 mds.pebbles-s2 Updating
MDS map
                     to version 2529650 from mon.3
                        -56> 2024-06-22T05:41:43.208+0100
                     7f184de84700  4 mds.0.purge_queue
operator():
                     data pool 6 not found in OSDMap
                        -55> 2024-06-22T05:41:43.208+0100
                     7f184de84700  4 mds.0.purge_queue
operator():
                     data pool 3 not found in OSDMap
                        -54> 2024-06-22T05:41:43.209+0100
                     7f184de84700  5 asok(0x5592e7968000)
                     register_command objecter_requests hook
                     0x5592e78f8800
                        -53> 2024-06-22T05:41:43.209+0100
                     7f184de84700 10 monclient: _renew_subs
                        -52> 2024-06-22T05:41:43.209+0100
                     7f184de84700 10 monclient:
_send_mon_message to
                     mon.pebbles-s4 at v2:10.1.5.134:3300/0
                     <http://10.1.5.134:3300/0>
                        -51> 2024-06-22T05:41:43.209+0100
                     7f184de84700 10 log_channel(cluster)
                     update_config to_monitors: true
to_syslog:
                     false syslog_facility:  prio: info
to_graylog:
                     false graylog_host: 127.0.0.1
graylog_port: 12201)
                        -50> 2024-06-22T05:41:43.209+0100
                     7f184de84700  4 mds.0.purge_queue
operator():
                     data pool 6 not found in OSDMap
                        -49> 2024-06-22T05:41:43.209+0100
                     7f184de84700  4 mds.0.purge_queue
operator():
                     data pool 3 not found in OSDMap
                        -48> 2024-06-22T05:41:43.209+0100
                     7f184de84700  4 mds.0.0 apply_blocklist:
killed
                     0, blocklisted sessions (0 blocklist
entries, 0)
                        -47> 2024-06-22T05:41:43.209+0100
                     7f184de84700  1 mds.0.2529650
handle_mds_map i
                     am now mds.0.2529650
                        -46> 2024-06-22T05:41:43.209+0100
                     7f184de84700  1 mds.0.2529650
handle_mds_map
                     state change up:standby --> up:replay
                        -45> 2024-06-22T05:41:43.209+0100
                     7f184de84700  5 mds.beacon.pebbles-s2
                     set_want_state: up:standby -> up:replay
                        -44> 2024-06-22T05:41:43.209+0100
                     7f184de84700  1 mds.0.2529650
replay_start
                        -43> 2024-06-22T05:41:43.209+0100
                     7f184de84700  1 mds.0.2529650 waiting for
                     osdmap 473739 (which blocklists prior
instance)
                        -42> 2024-06-22T05:41:43.209+0100
                     7f184de84700 10 monclient:
_send_mon_message to
                     mon.pebbles-s4 at v2:10.1.5.134:3300/0
                     <http://10.1.5.134:3300/0>
                        -41> 2024-06-22T05:41:43.209+0100
                     7f1849e7c700  2 mds.0.cache Memory
usage:
                     total 299012, rss 37624, heap 182556,
baseline
                     182556, 0 / 0 inodes have caps, 0 caps, 0
caps
                     per inode
                        -40> 2024-06-22T05:41:43.224+0100
                     7f184de84700 10 monclient: _renew_subs
                        -39> 2024-06-22T05:41:43.224+0100
                     7f184de84700 10 monclient:
_send_mon_message to
                     mon.pebbles-s4 at v2:10.1.5.134:3300/0
                     <http://10.1.5.134:3300/0>
                        -38> 2024-06-22T05:41:43.224+0100
                     7f184de84700 10 monclient:
                     handle_get_version_reply finishing 1
version 473739
                        -37> 2024-06-22T05:41:43.224+0100
                     7f1847e78700  2 mds.0.2529650 Booting: 0:
                     opening inotable
                        -36> 2024-06-22T05:41:43.224+0100
                     7f1847e78700  2 mds.0.2529650 Booting: 0:
                     opening sessionmap
                        -35> 2024-06-22T05:41:43.224+0100
                     7f1847e78700  2 mds.0.2529650 Booting: 0:
                     opening mds log
                        -34> 2024-06-22T05:41:43.224+0100
                     7f1847e78700  5 mds.0.log open
discovering log
                     bounds
                        -33> 2024-06-22T05:41:43.224+0100
                     7f1847e78700  2 mds.0.2529650 Booting: 0:
                     opening purge queue (async)
                        -32> 2024-06-22T05:41:43.224+0100
                     7f1847e78700  4 mds.0.purge_queue open:
opening
                        -31> 2024-06-22T05:41:43.224+0100
                     7f1847e78700  1 mds.0.journaler.pq(ro)
recover
                     start
                        -30> 2024-06-22T05:41:43.224+0100
                     7f1847e78700  1 mds.0.journaler.pq(ro)
read_head
                        -29> 2024-06-22T05:41:43.224+0100
                     7f1847e78700  2 mds.0.2529650 Booting: 0:
                     loading open file table (async)
                        -28> 2024-06-22T05:41:43.224+0100
                     7f1847e78700  2 mds.0.2529650 Booting: 0:
                     opening snap table
                        -27> 2024-06-22T05:41:43.224+0100
                     7f1847677700  4 mds.0.journalpointer
Reading
                     journal pointer '400.00000000'
                        -26> 2024-06-22T05:41:43.224+0100
                     7f1850689700 10 monclient:
get_auth_request con
                     0x5592e8987000 auth_method 0
                        -25> 2024-06-22T05:41:43.225+0100
                     7f1850e8a700 10 monclient:
get_auth_request con
                     0x5592e8987c00 auth_method 0
                        -24> 2024-06-22T05:41:43.252+0100
                     7f1848e7a700  1 mds.0.journaler.pq(ro)
                     _finish_read_head loghead(trim
231160676352,
                     expire 231163662875, write 231163662875,
                     stream_format 1).  probing for end of log
(from
                     231163662875)...
                        -23> 2024-06-22T05:41:43.252+0100
                     7f1848e7a700  1 mds.0.journaler.pq(ro)
probing
                     for end of the log
                        -22> 2024-06-22T05:41:43.252+0100
                     7f1847677700  1 mds.0.journaler.mdlog(ro)
                     recover start
                        -21> 2024-06-22T05:41:43.252+0100
                     7f1847677700  1 mds.0.journaler.mdlog(ro)
read_head
                        -20> 2024-06-22T05:41:43.252+0100
                     7f1847677700  4 mds.0.log Waiting for
journal
                     0x200 to recover...
                        -19> 2024-06-22T05:41:43.252+0100
                     7f1850689700 10 monclient:
get_auth_request con
                     0x5592e8bc6000 auth_method 0
                        -18> 2024-06-22T05:41:43.253+0100
                     7f185168b700 10 monclient:
get_auth_request con
                     0x5592e8bc6800 auth_method 0
                        -17> 2024-06-22T05:41:43.257+0100
                     7f1847e78700  1 mds.0.journaler.mdlog(ro)
                     _finish_read_head loghead(trim
90131453181952,
                     expire 90131465778558, write
90132009715463,
                     stream_format 1).  probing for end of log
(from
                     90132009715463)...
                        -16> 2024-06-22T05:41:43.257+0100
                     7f1847e78700  1 mds.0.journaler.mdlog(ro)
                     probing for end of the log
                        -15> 2024-06-22T05:41:43.257+0100
                     7f1847e78700  1 mds.0.journaler.mdlog(ro)
                     _finish_probe_end write_pos =
90132019384791
                     (header had 90132009715463). recovered.
                        -14> 2024-06-22T05:41:43.257+0100
                     7f1847677700  4 mds.0.log Journal 0x200
recovered.
                        -13> 2024-06-22T05:41:43.257+0100
                     7f1847677700  4 mds.0.log Recovered
journal
                     0x200 in format 1
                        -12> 2024-06-22T05:41:43.273+0100
                     7f1848e7a700  1 mds.0.journaler.pq(ro)
                     _finish_probe_end write_pos =
231163662875
                     (header had 231163662875). recovered.
                        -11> 2024-06-22T05:41:43.273+0100
                     7f1848e7a700  4 mds.0.purge_queue
operator():
                     open complete
                        -10> 2024-06-22T05:41:43.273+0100
                     7f1848e7a700  1 mds.0.journaler.pq(ro)
                     set_writeable
                         -9> 2024-06-22T05:41:43.441+0100
                     7f1847e78700  2 mds.0.2529650 Booting: 1:
                     loading/discovering base inodes
                         -8> 2024-06-22T05:41:43.441+0100
                     7f1847e78700  0 mds.0.cache creating
system
                     inode with ino:0x100
                         -7> 2024-06-22T05:41:43.442+0100
                     7f1847e78700  0 mds.0.cache creating
system
                     inode with ino:0x1
                         -6> 2024-06-22T05:41:43.442+0100
                     7f1847e78700  2 mds.0.2529650 Booting: 2:
                     replaying mds log
                         -5> 2024-06-22T05:41:43.442+0100
                     7f1847e78700  2 mds.0.2529650 Booting: 2:
                     waiting for purge queue recovered
                         -4> 2024-06-22T05:41:44.090+0100
                     7f184ce82700 10 monclient: tick
                         -3> 2024-06-22T05:41:44.090+0100
                     7f184ce82700 10 monclient:
_check_auth_rotating
                     have uptodate secrets (they expire after
                     2024-06-22T05:41:14.091638+0100)
                         -2> 2024-06-22T05:41:44.210+0100
                     7f1849e7c700  2 mds.0.cache Memory
usage:
                     total 588368, rss 308304, heap 207132,
baseline
                     182556, 0 / 15149 inodes have caps, 0
caps, 0
                     caps per inode
                         -1> 2024-06-22T05:41:44.642+0100
                     7f1846675700 -1
                     /home/jenkins-build/build/workspace/ceph-
build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos
8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/B
UILD/ceph-17.2.7/src/include/interval_set.h:
                     In function 'void interval_set<T,
C>::erase(T,
                     T, std::function<bool(T, T)>) [with T =
                     inodeno_t; C = std::map]' thread
7f1846675700
                     time 2024-06-22T05:41:44.643146+0100

                      ceph version 17.2.7

(b12291d110049b2f35e32e0de30d70e9a4c060d2)
                     quincy (stable)
                      1: (ceph::__ceph_assert_fail(char
const*, char
                     const*, int, char const*)+0x135)
[0x7f18568b64a3]
                      2:
                     /usr/lib64/ceph/libceph-
common.so.2(+0x269669)
                     [0x7f18568b6669]
                      3: (interval_set<inodeno_t,
                     std::map>::erase(inodeno_t, inodeno_t,
                     std::function<bool (inodeno_t,
                     inodeno_t)>)+0x2e5) [0x5592e5027885]
                      4: (EMetaBlob::replay(MDSRank*,
LogSegment*,
                     int, MDPeerUpdate*)+0x4377)
[0x5592e532c7b7]
                      5: (EUpdate::replay(MDSRank*)+0x61)
                     [0x5592e5330bd1]
                      6: (MDLog::_replay_thread()+0x7bb)
                     [0x5592e52b754b]
                      7: (MDLog::ReplayThread::entry()+0x11)
                     [0x5592e4f6a041]
                      8: /lib64/libpthread.so.0(+0x81ca)
                     [0x7f18558a41ca]
                      9: clone()

                          0> 2024-06-22T05:41:44.643+0100
                     7f1846675700 -1 *** Caught signal
(Aborted) **
                      in thread 7f1846675700
thread_name:md_log_replay

                      ceph version 17.2.7

(b12291d110049b2f35e32e0de30d70e9a4c060d2)
                     quincy (stable)
                      1: /lib64/libpthread.so.0(+0x12cf0)
                     [0x7f18558aecf0]
                      2: gsignal()
                      3: abort()
                      4: (ceph::__ceph_assert_fail(char
const*, char
                     const*, int, char const*)+0x18f)
[0x7f18568b64fd]
                      5:
                     /usr/lib64/ceph/libceph-
common.so.2(+0x269669)
                     [0x7f18568b6669]
                      6: (interval_set<inodeno_t,
                     std::map>::erase(inodeno_t, inodeno_t,
                     std::function<bool (inodeno_t,
                     inodeno_t)>)+0x2e5) [0x5592e5027885]
                      7: (EMetaBlob::replay(MDSRank*,
LogSegment*,
                     int, MDPeerUpdate*)+0x4377)
[0x5592e532c7b7]
                      8: (EUpdate::replay(MDSRank*)+0x61)
                     [0x5592e5330bd1]
                      9: (MDLog::_replay_thread()+0x7bb)
                     [0x5592e52b754b]
                      10: (MDLog::ReplayThread::entry()+0x11)
                     [0x5592e4f6a041]
                      11: /lib64/libpthread.so.0(+0x81ca)
                     [0x7f18558a41ca]
                      12: clone()

                 We have a relatively low debug setting
normally so
                 I don't think many details of the initial
crash
                 were captured unfortunately and the MDS logs
before
                 the above (i.e. "-60" and older) are just
beacon
                 messages and _check_auth_rotating checks.

                 I was wondering whether you have any
                 recommendations in terms of what actions we
could
                 take to bring our filesystem back into a
working
                 state short of rebuilding the entire metadata
pool?
                 We are quite keen to bring our backup back
into
                 service urgently as we currently do not have
any
                 accessible backups for our Ceph clusters.

                 Kindest regards,

                 Ivan

                 On 25/06/2024 19:18, Dhairya Parmar wrote:
                 CAUTION: This email originated from outside
of the
                 LMB:
                 *.-dparmar@xxxxxxxxxx-.*
                 Do not click links or open attachments
unless you
                 recognize the sender and know the content
is safe.
                 If you think this is a phishing email,
please
                 forward it to phishing@xxxxxxxxxxxxxxxxx

                 --

                 On Tue, Jun 25, 2024 at 6:38 PM Ivan
Clayson
                 <ivan@xxxxxxxxxxxxxxxxx> wrote:

                     Hi Dhairya,

                     Thank you for your rapid reply. I tried
                     recovering the dentries for the file
just
                     before the crash I mentioned before and
then
                     splicing the transactions from the
journal
                     which seemed to remove that issue for
that
                     inode but resulted in the MDS crashing
on the
                     next inode in the journal when
performing replay.

                 The MDS delegates a range of preallocated
inodes
                 (in form of a set - interval_set<inodeno_t>
                 preallocated_inos) to the clients, so it
can be
                 one inode that is untracked or some inodes
from
                 the range or in worst case scenario - ALL,
and
                 this is something that even the
                 `cephfs-journal-tool` would not be able to
tell
                 (since we're talking about MDS internals
which
                 aren't exposed to such tools). That is the
reason
                 why you see "MDS crashing on the next inode
in the
                 journal when performing replay".

                 An option could be to expose the inode set
to some
                 tool or asok cmd to identify such inodes
ranges,
                 which needs to be discussed. For now, we're
trying
                 to address this in [0], you can follow the
                 discussion there.

                 [0] https://tracker.ceph.com/issues/66251

                     Removing all the transactions involving
the
                     directory housing the files that seemed
to
                     cause these crashes from the journal
only
                     caused the MDS to fail to even start
replay.

                     I've rolled back our journal to our
original
                     version when the crash first happened
and the
                     entire MDS log for the crash can be
found
                     here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24

                 Awesome, this would help us a ton. Apart
from
                 this, would it be possible to send us
client logs?

                     Please let us know if you would like
any other
                     logs file as we can easily induce this
crash.

                 Since you can easily induce the crash, can
you
                 share the reproducer please i.e. what all
action
                 you take in order to hit this?

                     Kindest regards,

                     Ivan

                     On 25/06/2024 09:58, Dhairya Parmar
wrote:
                     CAUTION: This email originated from
outside
                     of the LMB:
                     *.-dparmar@xxxxxxxxxx-.*
                     Do not click links or open
attachments unless
                     you recognize the sender and know the
content
                     is safe.
                     If you think this is a phishing
email, please
                     forward it to
phishing@xxxxxxxxxxxxxxxxx

                     --

                     Hi Ivan,

                     This looks to be similar to the issue
[0]
                     that we're already addressing at [1].
So
                     basically there is some out-of-sync
event
                     that led the client to make use of
the inodes
                     that MDS wasn't aware of/isn't
tracking and
                     hence the crash. It'd be really
helpful if
                     you can provide us more logs.

                     CC @Rishabh Dave
<mailto:ridave@xxxxxxxxxx>
                     @Venky Shankar
<mailto:vshankar@xxxxxxxxxx>
                     @Patrick Donnelly
                     <mailto:pdonnell@xxxxxxxxxx> @Xiubo
Li
                     <mailto:xiubli@xxxxxxxxxx>

                     [0]
https://tracker.ceph.com/issues/61009
                     [1]
https://tracker.ceph.com/issues/66251
                     --
                     ***Dhairya Parmar*

                     Associate Software Engineer, CephFS

                     <https://www.redhat.com/>IBM, Inc.

                     On Mon, Jun 24, 2024 at 8:54 PM Ivan
Clayson
                     <ivan@xxxxxxxxxxxxxxxxx> wrote:

                         Hello,

                         We have been experiencing a
serious issue
                         with our CephFS backup cluster
                         running quincy (version 17.2.7)
on a
                         RHEL8-derivative Linux kernel
                         (Alma8.9, 4.18.0-513.9.1 kernel)
where
                         our MDSes for our filesystem are
                         constantly in a "replay" or
                         "replay(laggy)" state and keep
crashing.

                         We have a single MDS filesystem
called
                         "ceph_backup" with 2 standby
                         MDSes along with a 2nd unused
filesystem
                         "ceph_archive" (this holds
                         little to no data) where we are
using our
                         "ceph_backup" filesystem to
                         backup our data and this is the
one which
                         is currently broken. The Ceph
                         health outputs currently are:

                         root@pebbles-s1 14:05 [~]: ceph -
s
                                cluster:
                                  id:
                         e3f7535e-d35f-4a5d-88f0-
a1e97abcd631
                                  health: HEALTH_WARN
                                          1 filesystem is
degraded
                         insufficient standby MDS daemons
available
                         1319 pgs not deep-scrubbed in
time
                         1054 pgs not scrubbed in time

                                services:
                                  mon: 4 daemons, quorum
                         pebbles-s1,pebbles-s2,pebbles-
s3,pebbles-s4
                         (age 36m)
                                  mgr: pebbles-s2(active,
since
                         36m), standbys: pebbles-s4,
                             pebbles-s3, pebbles-s1
                                  mds: 2/2 daemons up
                                  osd: 1380 osds: 1380 up
(since
                         29m), 1379 in (since 3d); 37
                             remapped pgs

                                data:
                                  volumes: 1/2 healthy, 1
recovering
                                  pools: 7 pools, 2177 pgs
                                  objects: 3.55G objects,
7.0 PiB
                                  usage: 8.9 PiB used, 14
PiB / 23
                         PiB avail
                                  pgs:
83133528/30006841533
                         objects misplaced (0.277%)
                         2090 active+clean
                         47 active+clean+scrubbing+deep
                         29 active+remapped+backfilling
                         8 active+remapped+backfill_wait
                         2 active+clean+scrubbing
                         1 active+clean+snaptrim

                                io:
                                  recovery: 1.9 GiB/s, 719
objects/s

                         root@pebbles-s1 14:09 [~]: ceph
fs status
                             ceph_backup - 0 clients
                             ===========
                             RANK STATE MDS ACTIVITY   DNS
INOS
                         DIRS CAPS
                               0 replay(laggy) pebbles-s3
0      0
                         0      0
                         POOL TYPE     USED AVAIL
                         mds_backup_fs metadata  1255G
2780G
                         ec82_primary_fs_data data       0
2780G
                         ec82pool data    8442T 3044T
                             ceph_archive - 2 clients
                             ============
                             RANK STATE MDS ACTIVITY
DNS    INOS
                         DIRS CAPS
                               0    active pebbles-s2
Reqs:    0
                         /s 13.4k  7105    118 2
                         POOL TYPE     USED AVAIL
                         mds_archive_fs metadata  5184M
2780G
                         ec83_primary_fs_data data       0
2780G
                         ec83pool data     138T 2767T
                             MDS version: ceph version
17.2.7

(b12291d110049b2f35e32e0de30d70e9a4c060d2)
                         quincy (stable)
                         root@pebbles-s1 14:09 [~]: ceph
health
                         detail | head
                             HEALTH_WARN 1 filesystem is
degraded;
                         insufficient standby MDS
                             daemons available; 1319 pgs
not
                         deep-scrubbed in time; 1054 pgs
not
                             scrubbed in time
                             [WRN] FS_DEGRADED: 1
filesystem is
                         degraded
                                  fs ceph_backup is
degraded
                             [WRN]
MDS_INSUFFICIENT_STANDBY:
                         insufficient standby MDS daemons
                             available
                                  have 0; want 1 more

                         When our cluster first ran after
a
                         reboot, Ceph ran through the 2
                         standby MDSes, crashing them all,
until
                         it reached the final MDS and is
                         now stuck in this "replay(laggy)"
state.
                         Putting our MDSes into
                         debugging mode, we can see that
this MDS
                         crashed when replaying the
                         journal for a particular inode
(this is
                         the same for all the MDSes and
                         they all crash on the same
object):

                             ...
                         2024-06-24T13:44:55.563+0100
7f8811c40700
                         10 mds.0.journal
                         EMetaBlob.replay for [521,head]
had
                         [inode 0x1005ba89481
                             [...539,head]
                         /cephfs-
users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3
/cryolo/test_micrographs/
                             auth fragtree_t(*^2 00*^3
00000*^
                             4 00001*^3 00010*^4 00011*^4
00100*^4
                         00101*^4 00110*^4 00111*^4
                             01*^3 01000*^4 01001*^3
01010*^4
                         01011*^3 01100*^4 01101*^4
01110*^4
                             01111*^4 10*^3 10000*^4
10001*^4
                         10010*^4 10011*^4 10100*^4
10101*^3
                             10110*^4 10111*^4 11*^6)
v10880645
                         f(v0 m2024-06-22
                         T05:41:10.213700+0100
1281276=1281276+0)
                         n(v12
                         rc2024-06-22T05:41:10.213700+0100
                         b1348251683896 1281277=1281276+1)
                             old_inodes=8 (iversion lock)
|
                         dirfrag=416 dirty=1
0x55770a2bdb80]
                         2024-06-24T13:44:55.563+0100
7f8811c40700
                         10 mds.0.journal
                         EMetaBlob.replay dir
0x1005ba89481.011011000*
                         2024-06-24T13:44:55.563+0100
7f8811c40700
                         10 mds.0.journal
                         EMetaBlob.replay updated dir [dir
                         0x1005ba89481.011011000*
                         /cephfs-
users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3
/cryolo/test_micrographs/
                             [2,head] auth v=436385 cv=0/0
                         state=107374182
                             4 f(v0
                         m2024-06-22T05:41:10.213700+0100
                         2502=2502+0) n(v12
                         rc2024-06-22T05:41:10.213700+0100
                         b2120744220 2502=2502+0)
                         hs=32+33,ss=0+0 dirty=65 |
child=1
                         0x55770ebcda80]
                         2024-06-24T13:44:55.563+0100
7f8811c40700
                         10 mds.0.journal
                         EMetaBlob.replay added (full)
[dentry
                         #0x1/cephfs-
users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3
/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_
27626130_20210628_005006_fracti
                         ons_ave_Z124.mrc.teberet7.partial
                         [539,head] auth NULL (dversion
                             lock) v=436384 ino=(nil)
                         state=1610612800|bottomlru |
dirty=1
                         0x557710444500]
                         2024-06-24T13:44:55.563+0100
7f8811c40700
                         10 mds.0.journal
                         EMetaBlob.replay added [inode
                         0x1005cd4fe35 [539,head]
                         /cephfs-
users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3
/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_
27626130_20210628_

005006_fractions_ave_Z124.mrc.teberet7.partial
                         auth v436384 s=0 n(v0
                             1=1+0) (iversion lock)
                         cr={99995144=0-4194304@538}
0x557710438680]
                         2024-06-24T13:44:55.563+0100
7f8811c40700 10
                         mds.0.cache.ino(0x1005cd4fe35)
                         mark_dirty_parent
                         2024-06-24T13:44:55.563+0100
7f8811c40700
                         10 mds.0.journal
                         EMetaBlob.replay noting opened
inode
                         [inode 0x1005cd4fe35 [539,head]
                         /cephfs-
users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3
/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_
2762

6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.part
ial
                         auth
                             v436384 DIRTYPARENT s=0 n(v0
1=1+0)
                         (iversion lock)
                         cr={99995144=0-4194304@538} |
                         dirtyparent=1 dirty=1
0x557710438680]
                         2024-06-24T13:44:55.563+0100
7f8811c40700
                         10 mds.0.journal
                         EMetaBlob.replay inotable tablev
3112837
                         <= table 3112837
                         2024-06-24T13:44:55.563+0100
7f8811c40700
                         10 mds.0.journal
                         EMetaBlob.replay sessionmap v
1560540883,
                         table 1560540882 prealloc
                             [] used 0x1005cd4fe35
                         2024-06-24T13:44:55.563+0100
7f8811c40700 -1
                         /home/jenkins-
build/build/workspace/ceph-
build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/ce
ntos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/r
pm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                             I
                             n function 'void
interval_set<T,
                         C>::erase(T, T,
                         std::function<bool(T, T)>) [with
T =
                         inodeno_t; C = std::map]'
                             thread 7f8811c40700 time
                         2024-06-24T13:44:55.564315+0100
                         /home/jenkins-
build/build/workspace/ceph-
build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/ce
ntos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/r
pm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
                             568: FAILED ceph_assert(p-
first <=
                         start)

                               ceph version 17.2.7

(b12291d110049b2f35e32e0de30d70e9a4c060d2)
                             quincy (stable)
                               1:
(ceph::__ceph_assert_fail(char
                         const*, char const*, int, char
                             const*)+0x135)
[0x7f8821e814a3]
                               2:
                         /usr/lib64/ceph/libceph-
common.so.2(+0x269669)
                         [0x7f8821e81669]
                               3: (interval_set<inodeno_t,
                         std::map>::erase(inodeno_t,
inodeno_t,
                         std::function<bool (inodeno_t,
                         inodeno_t)>)+0x2e5)
[0x5576f9bb2885]
                               4:
(EMetaBlob::replay(MDSRank*,
                         LogSegment*, int,
                         MDPeerUpdate*)+0x4377)
[0x5576f9eb77b7]
                               5:
(EUpdate::replay(MDSRank*)+0x61)
                         [0x5576f9ebbbd1]
                               6:
(MDLog::_replay_thread()+0x7bb)
                         [0x5576f9e4254b]
                               7:

(MDLog::ReplayThread::entry()+0x11)
                         [0x5576f9af5041]
                               8:
/lib64/libpthread.so.0(+0x81ca)
                         [0x7f8820e6f1ca]
                               9: clone()

                         I've only included a short
section of the
                         crash (this is the first trace
                         in the log with regards to the
crash with
                         a 10/20 debug_mds option). We
                         tried deleting the 0x1005cd4fe35
object
                         from the object store using the
                         "rados" command but this did not
allow
                         our MDS to successfully replay.

                          From my understanding the
journal seems
                         okay as we didn't run out of
                         space for example on our metadata
pool
                         and "cephfs-journal-tool journal
                         inspect" doesn't seem to think
there is
                         any damage:

                         root@pebbles-s1 13:58 [~]:
                         cephfs-journal-tool --
rank=ceph_backup:0
                             journal inspect
                             Overall journal integrity: OK
                         root@pebbles-s1 14:04 [~]:
                         cephfs-journal-tool --
rank=ceph_backup:0
                             event get --inode
1101069090357 summary
                             Events by type:
                                OPEN: 1
                                UPDATE: 3
                             Errors: 0
                         root@pebbles-s1 14:05 [~]:
                         cephfs-journal-tool --
rank=ceph_backup:0
                             event get --inode
1101069090357 list
                         2024-06-22T05:41:10.214635+0100
                         0x51f97d4cfe35 UPDATE:  (openc)

test_micrographs/FoilHole_27649821_Data_27626128_27626130
_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                         2024-06-22T05:41:11.203312+0100
                         0x51f97d59c848 UPDATE:
                         (check_inode_max_size)

test_micrographs/FoilHole_27649821_Data_27626128_27626130
_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial

test_micrographs/FoilHole_27649821_Data_27626128_27626130
_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
                         2024-06-22T05:41:15.484871+0100
                         0x51f97e7344cc OPEN:  ()

FoilHole_27649821_Data_27626128_27626130_20210628_005006_
fractions_ave_Z124.mrc.teberet7.partial
                         2024-06-22T05:41:15.484921+0100
                         0x51f97e73493b UPDATE:  (rename)

test_micrographs/FoilHole_27649821_Data_27626128_27626130
_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial

test_micrographs/FoilHole_27649821_Data_27626128_27626130
_20210628_005006_fractions_ave_Z124.mrc

                         I was wondering whether anyone
had any
                         advice for us on how we should
                         proceed forward? We were thinking
about
                         manually applying these events
                         (via "event apply") where failing
that we
                         could erase this problematic
                         event with "cephfs-journal-tool
                         --rank=ceph_backup:0 event splice
                         --inode 1101069090357". Is this a
good
                         idea? We would rather not rebuild
                         the entire metadata pool if we
could
                         avoid it (once was enough for us)
                         as this cluster has ~9 PB of data
on it.

                         Kindest regards,

                         Ivan Clayson

                         --
                         Ivan Clayson
                         -----------------
                         Scientific Computing Officer
                         Room 2N249
                         Structural Studies
                         MRC Laboratory of Molecular
Biology
                         Francis Crick Ave, Cambridge
                         CB2 0QH

_______________________________________________
                         ceph-users mailing list --
ceph-users@xxxxxxx
                         To unsubscribe send an email to
                         ceph-users-leave@xxxxxxx

                     --
                     Ivan Clayson
                     -----------------
                     Scientific Computing Officer
                     Room 2N249
                     Structural Studies
                     MRC Laboratory of Molecular Biology
                     Francis Crick Ave, Cambridge
                     CB2 0QH

                 --
                 Ivan Clayson
                 -----------------
                 Scientific Computing Officer
                 Room 2N249
                 Structural Studies
                 MRC Laboratory of Molecular Biology
                 Francis Crick Ave, Cambridge
                 CB2 0QH

             --
             Ivan Clayson
             -----------------
             Scientific Computing Officer
             Room 2N249
             Structural Studies
             MRC Laboratory of Molecular Biology
             Francis Crick Ave, Cambridge
             CB2 0QH

         --
         Ivan Clayson
         -----------------
         Scientific Computing Officer
         Room 2N249
         Structural Studies
         MRC Laboratory of Molecular Biology
         Francis Crick Ave, Cambridge
         CB2 0QH

     --
     Ivan Clayson
     -----------------
     Scientific Computing Officer
     Room 2N249
     Structural Studies
     MRC Laboratory of Molecular Biology
     Francis Crick Ave, Cambridge
     CB2 0QH

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx