Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Dhairya Parmar <dparmar@xxxxxxxxxx> · Thu, 27 Jun 2024 17:16:58 +0530

Ivan, before resetting the journal, could you take the backup of your
journal using `cephfs-journal-tool export` [0] and send it to us through
`ceph-post-file` [1] or any other means you're comfortable with?

[0]
https://docs.ceph.com/en/latest/cephfs/cephfs-journal-tool/#example-journal-import-export
[1] https://docs.ceph.com/en/latest/man/8/ceph-post-file

On Thu, Jun 27, 2024 at 5:09 PM Dhairya Parmar <dparmar@xxxxxxxxxx> wrote:

> Hi Ivan,
>
> The solution (which has been successful for us in the past) is to reset
> the journal. This would bring the fs back online and return the MDSes to a
> stable state, but some data would be lost—the data in the journal that
> hasn't been flushed to the backing store would be gone. Therefore, you
> should try to flush out as much journal data as possible before resetting
> the journal.
>
> Here are the steps for this entire process:
>
> 1) Bring the FS offline
> $ ceph fs fail <fs_name>
>
> 2) Recover dentries from journal (run it with every MDS Rank)
> $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event recover_dentries
> summary
>
> 3) Reset the journal (again with every MDS Rank)
> $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> journal reset
>
> 4) Bring the FS online
> $ cephfs fs set <fs_name> joinable true
>
> 5) Restart the MDSes
>
> 6) Perform scrub to ensure consistency of fs
> $ ceph tell mds.<fs_name>:0 scrub start <path> [scrubopts] [tag]
> # you could try a recursive scrub maybe `ceph tell mds.<fs_name>:0 scrub
> start / recursive`
>
> Some important notes to keep in mind:
> * Recovering dentries will take time (generally, rank 0 is the most
> time-consuming, but the rest should be quick).
> * cephfs-journal-tool and metadata OSDs are bound to use a significant CPU
> percentage. This is because cephfs-journal-tool has to swig the journal
> data and flush it out to the backing store, which also makes the metadata
> operations go rampant, resulting in OSDs taking a significant percentage of
> CPU.
>
> Do let me know how this goes.
>
> On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx>
> wrote:
>
>> Hi Dhairya,
>>
>> We can induce the crash by simply restarting the MDS and the crash seems
>> to happen when an MDS goes from up:standby to up:replay. The MDS works
>> through a few files in the log before eventually crashing where I've
>> included the logs for this here (this is after I imported the backed up
>> journal which I hope was successful but please let me know if you suspect
>> it wasn't!):
>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log
>>
>> With respect to the client logs, are you referring to the clients who are
>> writing to the filesystem? We don't typically run them in any sort of debug
>> mode and we have quite a few machines running our backup system but we can
>> look an hour or so before the first MDS crash (though I don't know if this
>> is when the de-sync occurred). Here are some MDS logs with regards to the
>> initial crash on Saturday morning though which may be helpful:
>>
>>    -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick
>>    -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient:
>> _check_auth_rotating have uptodate secrets (they expire after
>> 2024-06-22T05:41:13.091556+0100)
>>    -57> 2024-06-22T05:41:43.208+0100 7f184de84700  1 mds.pebbles-s2
>> Updating MDS map to version 2529650 from mon.3
>>    -56> 2024-06-22T05:41:43.208+0100 7f184de84700  4 mds.0.purge_queue
>> operator():  data pool 6 not found in OSDMap
>>    -55> 2024-06-22T05:41:43.208+0100 7f184de84700  4 mds.0.purge_queue
>> operator():  data pool 3 not found in OSDMap
>>    -54> 2024-06-22T05:41:43.209+0100 7f184de84700  5 asok(0x5592e7968000)
>> register_command objecter_requests hook 0x5592e78f8800
>>    -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
>> _renew_subs
>>    -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
>>    -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10 log_channel(cluster)
>> update_config to_monitors: true to_syslog: false syslog_facility:  prio:
>> info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201)
>>    -50> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.purge_queue
>> operator():  data pool 6 not found in OSDMap
>>    -49> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.purge_queue
>> operator():  data pool 3 not found in OSDMap
>>    -48> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.0
>> apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0)
>>    -47> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>> handle_mds_map i am now mds.0.2529650
>>    -46> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>> handle_mds_map state change up:standby --> up:replay
>>    -45> 2024-06-22T05:41:43.209+0100 7f184de84700  5
>> mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay
>>    -44> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>> replay_start
>>    -43> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>> waiting for osdmap 473739 (which blocklists prior instance)
>>    -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
>>    -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700  2 mds.0.cache Memory
>> usage:  total 299012, rss 37624, heap 182556, baseline 182556, 0 / 0 inodes
>> have caps, 0 caps, 0 caps per inode
>>    -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
>> _renew_subs
>>    -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
>>    -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
>> handle_get_version_reply finishing 1 version 473739
>>    -37> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>> Booting: 0: opening inotable
>>    -36> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>> Booting: 0: opening sessionmap
>>    -35> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>> Booting: 0: opening mds log
>>    -34> 2024-06-22T05:41:43.224+0100 7f1847e78700  5 mds.0.log open
>> discovering log bounds
>>    -33> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>> Booting: 0: opening purge queue (async)
>>    -32> 2024-06-22T05:41:43.224+0100 7f1847e78700  4 mds.0.purge_queue
>> open: opening
>>    -31> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
>> mds.0.journaler.pq(ro) recover start
>>    -30> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
>> mds.0.journaler.pq(ro) read_head
>>    -29> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>> Booting: 0: loading open file table (async)
>>    -28> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
>> Booting: 0: opening snap table
>>    -27> 2024-06-22T05:41:43.224+0100 7f1847677700  4 mds.0.journalpointer
>> Reading journal pointer '400.00000000'
>>    -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10 monclient:
>> get_auth_request con 0x5592e8987000 auth_method 0
>>    -25> 2024-06-22T05:41:43.225+0100 7f1850e8a700 10 monclient:
>> get_auth_request con 0x5592e8987c00 auth_method 0
>>    -24> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
>> mds.0.journaler.pq(ro) _finish_read_head loghead(trim 231160676352, expire
>> 231163662875, write 231163662875, stream_format 1).  probing for end of log
>> (from 231163662875)...
>>    -23> 2024-06-22T05:41:43.252+0100 7f1848e7a700  1
>> mds.0.journaler.pq(ro) probing for end of the log
>>    -22> 2024-06-22T05:41:43.252+0100 7f1847677700  1
>> mds.0.journaler.mdlog(ro) recover start
>>    -21> 2024-06-22T05:41:43.252+0100 7f1847677700  1
>> mds.0.journaler.mdlog(ro) read_head
>>    -20> 2024-06-22T05:41:43.252+0100 7f1847677700  4 mds.0.log Waiting
>> for journal 0x200 to recover...
>>    -19> 2024-06-22T05:41:43.252+0100 7f1850689700 10 monclient:
>> get_auth_request con 0x5592e8bc6000 auth_method 0
>>    -18> 2024-06-22T05:41:43.253+0100 7f185168b700 10 monclient:
>> get_auth_request con 0x5592e8bc6800 auth_method 0
>>    -17> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
>> mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim 90131453181952,
>> expire 90131465778558, write 90132009715463, stream_format 1).  probing for
>> end of log (from 90132009715463)...
>>    -16> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
>> mds.0.journaler.mdlog(ro) probing for end of the log
>>    -15> 2024-06-22T05:41:43.257+0100 7f1847e78700  1
>> mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 90132019384791
>> (header had 90132009715463). recovered.
>>    -14> 2024-06-22T05:41:43.257+0100 7f1847677700  4 mds.0.log Journal
>> 0x200 recovered.
>>    -13> 2024-06-22T05:41:43.257+0100 7f1847677700  4 mds.0.log Recovered
>> journal 0x200 in format 1
>>    -12> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
>> mds.0.journaler.pq(ro) _finish_probe_end write_pos = 231163662875 (header
>> had 231163662875). recovered.
>>    -11> 2024-06-22T05:41:43.273+0100 7f1848e7a700  4 mds.0.purge_queue
>> operator(): open complete
>>    -10> 2024-06-22T05:41:43.273+0100 7f1848e7a700  1
>> mds.0.journaler.pq(ro) set_writeable
>>     -9> 2024-06-22T05:41:43.441+0100 7f1847e78700  2 mds.0.2529650
>> Booting: 1: loading/discovering base inodes
>>     -8> 2024-06-22T05:41:43.441+0100 7f1847e78700  0 mds.0.cache creating
>> system inode with ino:0x100
>>     -7> 2024-06-22T05:41:43.442+0100 7f1847e78700  0 mds.0.cache creating
>> system inode with ino:0x1
>>     -6> 2024-06-22T05:41:43.442+0100 7f1847e78700  2 mds.0.2529650
>> Booting: 2: replaying mds log
>>     -5> 2024-06-22T05:41:43.442+0100 7f1847e78700  2 mds.0.2529650
>> Booting: 2: waiting for purge queue recovered
>>     -4> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: tick
>>     -3> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient:
>> _check_auth_rotating have uptodate secrets (they expire after
>> 2024-06-22T05:41:14.091638+0100)
>>     -2> 2024-06-22T05:41:44.210+0100 7f1849e7c700  2 mds.0.cache Memory
>> usage:  total 588368, rss 308304, heap 207132, baseline 182556, 0 / 15149
>> inodes have caps, 0 caps, 0 caps per inode
>>     -1> 2024-06-22T05:41:44.642+0100 7f1846675700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
>> In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T,
>> T)>) [with T = inodeno_t; C = std::map]' thread 7f1846675700 time
>> 2024-06-22T05:41:44.643146+0100
>>
>>  ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
>> (stable)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x135) [0x7f18568b64a3]
>>  2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669]
>>  3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t,
>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885]
>>  4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4377)
>> [0x5592e532c7b7]
>>  5: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
>>  6: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
>>  7: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
>>  8: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
>>  9: clone()
>>
>>      0> 2024-06-22T05:41:44.643+0100 7f1846675700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f1846675700 thread_name:md_log_replay
>>
>>  ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
>> (stable)
>>  1: /lib64/libpthread.so.0(+0x12cf0) [0x7f18558aecf0]
>>  2: gsignal()
>>  3: abort()
>>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x18f) [0x7f18568b64fd]
>>  5: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669]
>>  6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t,
>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885]
>>  7: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4377)
>> [0x5592e532c7b7]
>>  8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
>>  9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
>>  10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
>>  11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
>>  12: clone()
>>
>> We have a relatively low debug setting normally so I don't think many
>> details of the initial crash were captured unfortunately and the MDS logs
>> before the above (i.e. "-60" and older) are just beacon messages and
>> _check_auth_rotating checks.
>>
>> I was wondering whether you have any recommendations in terms of what
>> actions we could take to bring our filesystem back into a working state
>> short of rebuilding the entire metadata pool? We are quite keen to bring
>> our backup back into service urgently as we currently do not have any
>> accessible backups for our Ceph clusters.
>>
>> Kindest regards,
>>
>> Ivan
>> On 25/06/2024 19:18, Dhairya Parmar wrote:
>>
>> CAUTION: This email originated from outside of the LMB:
>> *.-dparmar@xxxxxxxxxx-.*
>> Do not click links or open attachments unless you recognize the sender
>> and know the content is safe.
>> If you think this is a phishing email, please forward it to
>> phishing@xxxxxxxxxxxxxxxxx
>>
>>
>> --
>>
>>
>> On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx>
>> wrote:
>>
>>> Hi Dhairya,
>>>
>>> Thank you for your rapid reply. I tried recovering the dentries for the
>>> file just before the crash I mentioned before and then splicing the
>>> transactions from the journal which seemed to remove that issue for that
>>> inode but resulted in the MDS crashing on the next inode in the journal
>>> when performing replay.
>>>
>> The MDS delegates a range of preallocated inodes (in form of a set -
>> interval_set<inodeno_t> preallocated_inos) to the clients, so it can be one
>> inode that is untracked or some inodes from the range or in worst case
>> scenario - ALL, and this is something that even the `cephfs-journal-tool`
>> would not be able to tell (since we're talking about MDS internals which
>> aren't exposed to such tools). That is the reason why you see "MDS crashing
>> on the next inode in the journal when performing replay".
>>
>> An option could be to expose the inode set to some tool or asok cmd to
>> identify such inodes ranges, which needs to be discussed. For now, we're
>> trying to address this in [0], you can follow the discussion there.
>>
>> [0] https://tracker.ceph.com/issues/66251
>>
>>> Removing all the transactions involving the directory housing the files
>>> that seemed to cause these crashes from the journal only caused the MDS to
>>> fail to even start replay.
>>>
>> I've rolled back our journal to our original version when the crash first
>>> happened and the entire MDS log for the crash can be found here:
>>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24
>>>
>> Awesome, this would help us a ton. Apart from this, would it be possible
>> to send us client logs?
>>
>>> Please let us know if you would like any other logs file as we can
>>> easily induce this crash.
>>>
>> Since you can easily induce the crash, can you share the reproducer
>> please i.e. what all action you take in order to hit this?
>>
>>> Kindest regards,
>>>
>>> Ivan
>>> On 25/06/2024 09:58, Dhairya Parmar wrote:
>>>
>>> CAUTION: This email originated from outside of the LMB:
>>> *.-dparmar@xxxxxxxxxx-.*
>>> Do not click links or open attachments unless you recognize the sender
>>> and know the content is safe.
>>> If you think this is a phishing email, please forward it to
>>> phishing@xxxxxxxxxxxxxxxxx
>>>
>>>
>>> --
>>> Hi Ivan,
>>>
>>> This looks to be similar to the issue [0] that we're already addressing
>>> at [1]. So basically there is some out-of-sync event that led the client to
>>> make use of the inodes that MDS wasn't aware of/isn't tracking and hence
>>> the crash. It'd be really helpful if you can provide us more logs.
>>>
>>> CC @Rishabh Dave <ridave@xxxxxxxxxx> @Venky Shankar
>>> <vshankar@xxxxxxxxxx> @Patrick Donnelly <pdonnell@xxxxxxxxxx> @Xiubo Li
>>> <xiubli@xxxxxxxxxx>
>>>
>>> [0] https://tracker.ceph.com/issues/61009
>>> [1] https://tracker.ceph.com/issues/66251
>>> --
>>> *Dhairya Parmar*
>>>
>>> Associate Software Engineer, CephFS
>>>
>>> <https://www.redhat.com/>IBM, Inc.
>>>
>>> On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have been experiencing a serious issue with our CephFS backup
>>>> cluster
>>>> running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel
>>>> (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are
>>>> constantly in a "replay" or "replay(laggy)" state and keep crashing.
>>>>
>>>> We have a single MDS filesystem called "ceph_backup" with 2 standby
>>>> MDSes along with a 2nd unused filesystem "ceph_archive" (this holds
>>>> little to no data) where we are using our "ceph_backup" filesystem to
>>>> backup our data and this is the one which is currently broken. The Ceph
>>>> health outputs currently are:
>>>>
>>>>     root@pebbles-s1 14:05 [~]: ceph -s
>>>>        cluster:
>>>>          id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
>>>>          health: HEALTH_WARN
>>>>                  1 filesystem is degraded
>>>>                  insufficient standby MDS daemons available
>>>>                  1319 pgs not deep-scrubbed in time
>>>>                  1054 pgs not scrubbed in time
>>>>
>>>>        services:
>>>>          mon: 4 daemons, quorum
>>>>     pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
>>>>          mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4,
>>>>     pebbles-s3, pebbles-s1
>>>>          mds: 2/2 daemons up
>>>>          osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37
>>>>     remapped pgs
>>>>
>>>>        data:
>>>>          volumes: 1/2 healthy, 1 recovering
>>>>          pools:   7 pools, 2177 pgs
>>>>          objects: 3.55G objects, 7.0 PiB
>>>>          usage:   8.9 PiB used, 14 PiB / 23 PiB avail
>>>>          pgs:     83133528/30006841533 objects misplaced (0.277%)
>>>>                   2090 active+clean
>>>>                   47   active+clean+scrubbing+deep
>>>>                   29   active+remapped+backfilling
>>>>                   8    active+remapped+backfill_wait
>>>>                   2    active+clean+scrubbing
>>>>                   1    active+clean+snaptrim
>>>>
>>>>        io:
>>>>          recovery: 1.9 GiB/s, 719 objects/s
>>>>
>>>>     root@pebbles-s1 14:09 [~]: ceph fs status
>>>>     ceph_backup - 0 clients
>>>>     ===========
>>>>     RANK      STATE         MDS      ACTIVITY   DNS    INOS   DIRS CAPS
>>>>       0    replay(laggy)  pebbles-s3               0      0 0      0
>>>>              POOL            TYPE     USED  AVAIL
>>>>         mds_backup_fs      metadata  1255G  2780G
>>>>     ec82_primary_fs_data    data       0   2780G
>>>>            ec82pool          data    8442T  3044T
>>>>     ceph_archive - 2 clients
>>>>     ============
>>>>     RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS CAPS
>>>>       0    active  pebbles-s2  Reqs:    0 /s  13.4k  7105    118 2
>>>>              POOL            TYPE     USED  AVAIL
>>>>         mds_archive_fs     metadata  5184M  2780G
>>>>     ec83_primary_fs_data    data       0   2780G
>>>>            ec83pool          data     138T  2767T
>>>>     MDS version: ceph version 17.2.7
>>>>     (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
>>>>     root@pebbles-s1 14:09 [~]: ceph health detail | head
>>>>     HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS
>>>>     daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs not
>>>>     scrubbed in time
>>>>     [WRN] FS_DEGRADED: 1 filesystem is degraded
>>>>          fs ceph_backup is degraded
>>>>     [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons
>>>>     available
>>>>          have 0; want 1 more
>>>>
>>>> When our cluster first ran after a reboot, Ceph ran through the 2
>>>> standby MDSes, crashing them all, until it reached the final MDS and is
>>>> now stuck in this "replay(laggy)" state. Putting our MDSes into
>>>> debugging mode, we can see that this MDS crashed when replaying the
>>>> journal for a particular inode (this is the same for all the MDSes and
>>>> they all crash on the same object):
>>>>
>>>>     ...
>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>     EMetaBlob.replay for [521,head] had [inode 0x1005ba89481
>>>>     [...539,head]
>>>>
>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
>>>>     auth fragtree_t(*^2 00*^3 00000*^
>>>>     4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4
>>>>     01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4 01110*^4
>>>>     01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4 10101*^3
>>>>     10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22
>>>>     T05:41:10.213700+0100 1281276=1281276+0) n(v12
>>>>     rc2024-06-22T05:41:10.213700+0100 b1348251683896 1281277=1281276+1)
>>>>     old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80]
>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>     EMetaBlob.replay dir 0x1005ba89481.011011000*
>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>     EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000*
>>>>
>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
>>>>     [2,head] auth v=436385 cv=0/0 state=107374182
>>>>     4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12
>>>>     rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0)
>>>>     hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80]
>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>     EMetaBlob.replay added (full) [dentry
>>>>
>>>> #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
>>>>     ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion
>>>>     lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1
>>>>     0x557710444500]
>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>     EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head]
>>>>
>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
>>>>     005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384 s=0 n(v0
>>>>     1=1+0) (iversion lock) cr={99995144=0-4194304@538} 0x557710438680]
>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10
>>>>     mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent
>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>     EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35 [539,head]
>>>>
>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
>>>>     6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth
>>>>     v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
>>>>     cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1 0x557710438680]
>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>     EMetaBlob.replay inotable tablev 3112837 <= table 3112837
>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>>>     EMetaBlob.replay sessionmap v 1560540883, table 1560540882 prealloc
>>>>     [] used 0x1005cd4fe35
>>>>     2024-06-24T13:44:55.563+0100 7f8811c40700 -1
>>>>
>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
>>>>     I
>>>>     n function 'void interval_set<T, C>::erase(T, T,
>>>>     std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]'
>>>>     thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100
>>>>
>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
>>>>     568: FAILED ceph_assert(p->first <= start)
>>>>
>>>>       ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2)
>>>>     quincy (stable)
>>>>       1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>     const*)+0x135) [0x7f8821e814a3]
>>>>       2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f8821e81669]
>>>>       3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t,
>>>>     std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5576f9bb2885]
>>>>       4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
>>>>     MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
>>>>       5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1]
>>>>       6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b]
>>>>       7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041]
>>>>       8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca]
>>>>       9: clone()
>>>>
>>>> I've only included a short section of the crash (this is the first
>>>> trace
>>>> in the log with regards to the crash with a 10/20 debug_mds option). We
>>>> tried deleting the 0x1005cd4fe35 object from the object store using the
>>>> "rados" command but this did not allow our MDS to successfully replay.
>>>>
>>>>  From my understanding the journal seems okay as we didn't run out of
>>>> space for example on our metadata pool and "cephfs-journal-tool journal
>>>> inspect" doesn't seem to think there is any damage:
>>>>
>>>>     root@pebbles-s1 13:58 [~]: cephfs-journal-tool --rank=ceph_backup:0
>>>>     journal inspect
>>>>     Overall journal integrity: OK
>>>>     root@pebbles-s1 14:04 [~]: cephfs-journal-tool --rank=ceph_backup:0
>>>>     event get --inode 1101069090357 summary
>>>>     Events by type:
>>>>        OPEN: 1
>>>>        UPDATE: 3
>>>>     Errors: 0
>>>>     root@pebbles-s1 14:05 [~]: cephfs-journal-tool --rank=ceph_backup:0
>>>>     event get --inode 1101069090357 list
>>>>     2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE:  (openc)
>>>>
>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>>>     2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE:
>>>>     (check_inode_max_size)
>>>>
>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>>>
>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>>>     2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN:  ()
>>>>
>>>> FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>>>     2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE:  (rename)
>>>>
>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>>>
>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc
>>>>
>>>> I was wondering whether anyone had any advice for us on how we should
>>>> proceed forward? We were thinking about manually applying these events
>>>> (via "event apply") where failing that we could erase this problematic
>>>> event with "cephfs-journal-tool --rank=ceph_backup:0 event splice
>>>> --inode 1101069090357". Is this a good idea? We would rather not
>>>> rebuild
>>>> the entire metadata pool if we could avoid it (once was enough for us)
>>>> as this cluster has ~9 PB of data on it.
>>>>
>>>> Kindest regards,
>>>>
>>>> Ivan Clayson
>>>>
>>>> --
>>>> Ivan Clayson
>>>> -----------------
>>>> Scientific Computing Officer
>>>> Room 2N249
>>>> Structural Studies
>>>> MRC Laboratory of Molecular Biology
>>>> Francis Crick Ave, Cambridge
>>>> CB2 0QH
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>> --
>>> Ivan Clayson
>>> -----------------
>>> Scientific Computing Officer
>>> Room 2N249
>>> Structural Studies
>>> MRC Laboratory of Molecular Biology
>>> Francis Crick Ave, Cambridge
>>> CB2 0QH
>>>
>>> --
>> Ivan Clayson
>> -----------------
>> Scientific Computing Officer
>> Room 2N249
>> Structural Studies
>> MRC Laboratory of Molecular Biology
>> Francis Crick Ave, Cambridge
>> CB2 0QH
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx