Ivan, before resetting the journal, could you take the backup of your journal using `cephfs-journal-tool export` [0] and send it to us through `ceph-post-file` [1] or any other means you're comfortable with? [0] https://docs.ceph.com/en/latest/cephfs/cephfs-journal-tool/#example-journal-import-export [1] https://docs.ceph.com/en/latest/man/8/ceph-post-file On Thu, Jun 27, 2024 at 5:09 PM Dhairya Parmar <dparmar@xxxxxxxxxx> wrote: > Hi Ivan, > > The solution (which has been successful for us in the past) is to reset > the journal. This would bring the fs back online and return the MDSes to a > stable state, but some data would be lost—the data in the journal that > hasn't been flushed to the backing store would be gone. Therefore, you > should try to flush out as much journal data as possible before resetting > the journal. > > Here are the steps for this entire process: > > 1) Bring the FS offline > $ ceph fs fail <fs_name> > > 2) Recover dentries from journal (run it with every MDS Rank) > $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event recover_dentries > summary > > 3) Reset the journal (again with every MDS Rank) > $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> journal reset > > 4) Bring the FS online > $ cephfs fs set <fs_name> joinable true > > 5) Restart the MDSes > > 6) Perform scrub to ensure consistency of fs > $ ceph tell mds.<fs_name>:0 scrub start <path> [scrubopts] [tag] > # you could try a recursive scrub maybe `ceph tell mds.<fs_name>:0 scrub > start / recursive` > > Some important notes to keep in mind: > * Recovering dentries will take time (generally, rank 0 is the most > time-consuming, but the rest should be quick). > * cephfs-journal-tool and metadata OSDs are bound to use a significant CPU > percentage. This is because cephfs-journal-tool has to swig the journal > data and flush it out to the backing store, which also makes the metadata > operations go rampant, resulting in OSDs taking a significant percentage of > CPU. > > Do let me know how this goes. > > On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> > wrote: > >> Hi Dhairya, >> >> We can induce the crash by simply restarting the MDS and the crash seems >> to happen when an MDS goes from up:standby to up:replay. The MDS works >> through a few files in the log before eventually crashing where I've >> included the logs for this here (this is after I imported the backed up >> journal which I hope was successful but please let me know if you suspect >> it wasn't!): >> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log >> >> With respect to the client logs, are you referring to the clients who are >> writing to the filesystem? We don't typically run them in any sort of debug >> mode and we have quite a few machines running our backup system but we can >> look an hour or so before the first MDS crash (though I don't know if this >> is when the de-sync occurred). Here are some MDS logs with regards to the >> initial crash on Saturday morning though which may be helpful: >> >> -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick >> -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: >> _check_auth_rotating have uptodate secrets (they expire after >> 2024-06-22T05:41:13.091556+0100) >> -57> 2024-06-22T05:41:43.208+0100 7f184de84700 1 mds.pebbles-s2 >> Updating MDS map to version 2529650 from mon.3 >> -56> 2024-06-22T05:41:43.208+0100 7f184de84700 4 mds.0.purge_queue >> operator(): data pool 6 not found in OSDMap >> -55> 2024-06-22T05:41:43.208+0100 7f184de84700 4 mds.0.purge_queue >> operator(): data pool 3 not found in OSDMap >> -54> 2024-06-22T05:41:43.209+0100 7f184de84700 5 asok(0x5592e7968000) >> register_command objecter_requests hook 0x5592e78f8800 >> -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: >> _renew_subs >> -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: >> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 >> -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10 log_channel(cluster) >> update_config to_monitors: true to_syslog: false syslog_facility: prio: >> info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201) >> -50> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.purge_queue >> operator(): data pool 6 not found in OSDMap >> -49> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.purge_queue >> operator(): data pool 3 not found in OSDMap >> -48> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.0 >> apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0) >> -47> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >> handle_mds_map i am now mds.0.2529650 >> -46> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >> handle_mds_map state change up:standby --> up:replay >> -45> 2024-06-22T05:41:43.209+0100 7f184de84700 5 >> mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay >> -44> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >> replay_start >> -43> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >> waiting for osdmap 473739 (which blocklists prior instance) >> -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: >> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 >> -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700 2 mds.0.cache Memory >> usage: total 299012, rss 37624, heap 182556, baseline 182556, 0 / 0 inodes >> have caps, 0 caps, 0 caps per inode >> -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: >> _renew_subs >> -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: >> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 >> -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: >> handle_get_version_reply finishing 1 version 473739 >> -37> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >> Booting: 0: opening inotable >> -36> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >> Booting: 0: opening sessionmap >> -35> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >> Booting: 0: opening mds log >> -34> 2024-06-22T05:41:43.224+0100 7f1847e78700 5 mds.0.log open >> discovering log bounds >> -33> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >> Booting: 0: opening purge queue (async) >> -32> 2024-06-22T05:41:43.224+0100 7f1847e78700 4 mds.0.purge_queue >> open: opening >> -31> 2024-06-22T05:41:43.224+0100 7f1847e78700 1 >> mds.0.journaler.pq(ro) recover start >> -30> 2024-06-22T05:41:43.224+0100 7f1847e78700 1 >> mds.0.journaler.pq(ro) read_head >> -29> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >> Booting: 0: loading open file table (async) >> -28> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >> Booting: 0: opening snap table >> -27> 2024-06-22T05:41:43.224+0100 7f1847677700 4 mds.0.journalpointer >> Reading journal pointer '400.00000000' >> -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10 monclient: >> get_auth_request con 0x5592e8987000 auth_method 0 >> -25> 2024-06-22T05:41:43.225+0100 7f1850e8a700 10 monclient: >> get_auth_request con 0x5592e8987c00 auth_method 0 >> -24> 2024-06-22T05:41:43.252+0100 7f1848e7a700 1 >> mds.0.journaler.pq(ro) _finish_read_head loghead(trim 231160676352, expire >> 231163662875, write 231163662875, stream_format 1). probing for end of log >> (from 231163662875)... >> -23> 2024-06-22T05:41:43.252+0100 7f1848e7a700 1 >> mds.0.journaler.pq(ro) probing for end of the log >> -22> 2024-06-22T05:41:43.252+0100 7f1847677700 1 >> mds.0.journaler.mdlog(ro) recover start >> -21> 2024-06-22T05:41:43.252+0100 7f1847677700 1 >> mds.0.journaler.mdlog(ro) read_head >> -20> 2024-06-22T05:41:43.252+0100 7f1847677700 4 mds.0.log Waiting >> for journal 0x200 to recover... >> -19> 2024-06-22T05:41:43.252+0100 7f1850689700 10 monclient: >> get_auth_request con 0x5592e8bc6000 auth_method 0 >> -18> 2024-06-22T05:41:43.253+0100 7f185168b700 10 monclient: >> get_auth_request con 0x5592e8bc6800 auth_method 0 >> -17> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 >> mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim 90131453181952, >> expire 90131465778558, write 90132009715463, stream_format 1). probing for >> end of log (from 90132009715463)... >> -16> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 >> mds.0.journaler.mdlog(ro) probing for end of the log >> -15> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 >> mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 90132019384791 >> (header had 90132009715463). recovered. >> -14> 2024-06-22T05:41:43.257+0100 7f1847677700 4 mds.0.log Journal >> 0x200 recovered. >> -13> 2024-06-22T05:41:43.257+0100 7f1847677700 4 mds.0.log Recovered >> journal 0x200 in format 1 >> -12> 2024-06-22T05:41:43.273+0100 7f1848e7a700 1 >> mds.0.journaler.pq(ro) _finish_probe_end write_pos = 231163662875 (header >> had 231163662875). recovered. >> -11> 2024-06-22T05:41:43.273+0100 7f1848e7a700 4 mds.0.purge_queue >> operator(): open complete >> -10> 2024-06-22T05:41:43.273+0100 7f1848e7a700 1 >> mds.0.journaler.pq(ro) set_writeable >> -9> 2024-06-22T05:41:43.441+0100 7f1847e78700 2 mds.0.2529650 >> Booting: 1: loading/discovering base inodes >> -8> 2024-06-22T05:41:43.441+0100 7f1847e78700 0 mds.0.cache creating >> system inode with ino:0x100 >> -7> 2024-06-22T05:41:43.442+0100 7f1847e78700 0 mds.0.cache creating >> system inode with ino:0x1 >> -6> 2024-06-22T05:41:43.442+0100 7f1847e78700 2 mds.0.2529650 >> Booting: 2: replaying mds log >> -5> 2024-06-22T05:41:43.442+0100 7f1847e78700 2 mds.0.2529650 >> Booting: 2: waiting for purge queue recovered >> -4> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: tick >> -3> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: >> _check_auth_rotating have uptodate secrets (they expire after >> 2024-06-22T05:41:14.091638+0100) >> -2> 2024-06-22T05:41:44.210+0100 7f1849e7c700 2 mds.0.cache Memory >> usage: total 588368, rss 308304, heap 207132, baseline 182556, 0 / 15149 >> inodes have caps, 0 caps, 0 caps per inode >> -1> 2024-06-22T05:41:44.642+0100 7f1846675700 -1 >> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: >> In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, >> T)>) [with T = inodeno_t; C = std::map]' thread 7f1846675700 time >> 2024-06-22T05:41:44.643146+0100 >> >> ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy >> (stable) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x135) [0x7f18568b64a3] >> 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669] >> 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, >> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885] >> 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4377) >> [0x5592e532c7b7] >> 5: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1] >> 6: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b] >> 7: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041] >> 8: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca] >> 9: clone() >> >> 0> 2024-06-22T05:41:44.643+0100 7f1846675700 -1 *** Caught signal >> (Aborted) ** >> in thread 7f1846675700 thread_name:md_log_replay >> >> ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy >> (stable) >> 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f18558aecf0] >> 2: gsignal() >> 3: abort() >> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x18f) [0x7f18568b64fd] >> 5: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669] >> 6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, >> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885] >> 7: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4377) >> [0x5592e532c7b7] >> 8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1] >> 9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b] >> 10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041] >> 11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca] >> 12: clone() >> >> We have a relatively low debug setting normally so I don't think many >> details of the initial crash were captured unfortunately and the MDS logs >> before the above (i.e. "-60" and older) are just beacon messages and >> _check_auth_rotating checks. >> >> I was wondering whether you have any recommendations in terms of what >> actions we could take to bring our filesystem back into a working state >> short of rebuilding the entire metadata pool? We are quite keen to bring >> our backup back into service urgently as we currently do not have any >> accessible backups for our Ceph clusters. >> >> Kindest regards, >> >> Ivan >> On 25/06/2024 19:18, Dhairya Parmar wrote: >> >> CAUTION: This email originated from outside of the LMB: >> *.-dparmar@xxxxxxxxxx-.* >> Do not click links or open attachments unless you recognize the sender >> and know the content is safe. >> If you think this is a phishing email, please forward it to >> phishing@xxxxxxxxxxxxxxxxx >> >> >> -- >> >> >> On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> >> wrote: >> >>> Hi Dhairya, >>> >>> Thank you for your rapid reply. I tried recovering the dentries for the >>> file just before the crash I mentioned before and then splicing the >>> transactions from the journal which seemed to remove that issue for that >>> inode but resulted in the MDS crashing on the next inode in the journal >>> when performing replay. >>> >> The MDS delegates a range of preallocated inodes (in form of a set - >> interval_set<inodeno_t> preallocated_inos) to the clients, so it can be one >> inode that is untracked or some inodes from the range or in worst case >> scenario - ALL, and this is something that even the `cephfs-journal-tool` >> would not be able to tell (since we're talking about MDS internals which >> aren't exposed to such tools). That is the reason why you see "MDS crashing >> on the next inode in the journal when performing replay". >> >> An option could be to expose the inode set to some tool or asok cmd to >> identify such inodes ranges, which needs to be discussed. For now, we're >> trying to address this in [0], you can follow the discussion there. >> >> [0] https://tracker.ceph.com/issues/66251 >> >>> Removing all the transactions involving the directory housing the files >>> that seemed to cause these crashes from the journal only caused the MDS to >>> fail to even start replay. >>> >> I've rolled back our journal to our original version when the crash first >>> happened and the entire MDS log for the crash can be found here: >>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24 >>> >> Awesome, this would help us a ton. Apart from this, would it be possible >> to send us client logs? >> >>> Please let us know if you would like any other logs file as we can >>> easily induce this crash. >>> >> Since you can easily induce the crash, can you share the reproducer >> please i.e. what all action you take in order to hit this? >> >>> Kindest regards, >>> >>> Ivan >>> On 25/06/2024 09:58, Dhairya Parmar wrote: >>> >>> CAUTION: This email originated from outside of the LMB: >>> *.-dparmar@xxxxxxxxxx-.* >>> Do not click links or open attachments unless you recognize the sender >>> and know the content is safe. >>> If you think this is a phishing email, please forward it to >>> phishing@xxxxxxxxxxxxxxxxx >>> >>> >>> -- >>> Hi Ivan, >>> >>> This looks to be similar to the issue [0] that we're already addressing >>> at [1]. So basically there is some out-of-sync event that led the client to >>> make use of the inodes that MDS wasn't aware of/isn't tracking and hence >>> the crash. It'd be really helpful if you can provide us more logs. >>> >>> CC @Rishabh Dave <ridave@xxxxxxxxxx> @Venky Shankar >>> <vshankar@xxxxxxxxxx> @Patrick Donnelly <pdonnell@xxxxxxxxxx> @Xiubo Li >>> <xiubli@xxxxxxxxxx> >>> >>> [0] https://tracker.ceph.com/issues/61009 >>> [1] https://tracker.ceph.com/issues/66251 >>> -- >>> *Dhairya Parmar* >>> >>> Associate Software Engineer, CephFS >>> >>> <https://www.redhat.com/>IBM, Inc. >>> >>> On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> >>> wrote: >>> >>>> Hello, >>>> >>>> We have been experiencing a serious issue with our CephFS backup >>>> cluster >>>> running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel >>>> (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are >>>> constantly in a "replay" or "replay(laggy)" state and keep crashing. >>>> >>>> We have a single MDS filesystem called "ceph_backup" with 2 standby >>>> MDSes along with a 2nd unused filesystem "ceph_archive" (this holds >>>> little to no data) where we are using our "ceph_backup" filesystem to >>>> backup our data and this is the one which is currently broken. The Ceph >>>> health outputs currently are: >>>> >>>> root@pebbles-s1 14:05 [~]: ceph -s >>>> cluster: >>>> id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 >>>> health: HEALTH_WARN >>>> 1 filesystem is degraded >>>> insufficient standby MDS daemons available >>>> 1319 pgs not deep-scrubbed in time >>>> 1054 pgs not scrubbed in time >>>> >>>> services: >>>> mon: 4 daemons, quorum >>>> pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m) >>>> mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4, >>>> pebbles-s3, pebbles-s1 >>>> mds: 2/2 daemons up >>>> osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37 >>>> remapped pgs >>>> >>>> data: >>>> volumes: 1/2 healthy, 1 recovering >>>> pools: 7 pools, 2177 pgs >>>> objects: 3.55G objects, 7.0 PiB >>>> usage: 8.9 PiB used, 14 PiB / 23 PiB avail >>>> pgs: 83133528/30006841533 objects misplaced (0.277%) >>>> 2090 active+clean >>>> 47 active+clean+scrubbing+deep >>>> 29 active+remapped+backfilling >>>> 8 active+remapped+backfill_wait >>>> 2 active+clean+scrubbing >>>> 1 active+clean+snaptrim >>>> >>>> io: >>>> recovery: 1.9 GiB/s, 719 objects/s >>>> >>>> root@pebbles-s1 14:09 [~]: ceph fs status >>>> ceph_backup - 0 clients >>>> =========== >>>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS >>>> 0 replay(laggy) pebbles-s3 0 0 0 0 >>>> POOL TYPE USED AVAIL >>>> mds_backup_fs metadata 1255G 2780G >>>> ec82_primary_fs_data data 0 2780G >>>> ec82pool data 8442T 3044T >>>> ceph_archive - 2 clients >>>> ============ >>>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS >>>> 0 active pebbles-s2 Reqs: 0 /s 13.4k 7105 118 2 >>>> POOL TYPE USED AVAIL >>>> mds_archive_fs metadata 5184M 2780G >>>> ec83_primary_fs_data data 0 2780G >>>> ec83pool data 138T 2767T >>>> MDS version: ceph version 17.2.7 >>>> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) >>>> root@pebbles-s1 14:09 [~]: ceph health detail | head >>>> HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS >>>> daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs not >>>> scrubbed in time >>>> [WRN] FS_DEGRADED: 1 filesystem is degraded >>>> fs ceph_backup is degraded >>>> [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons >>>> available >>>> have 0; want 1 more >>>> >>>> When our cluster first ran after a reboot, Ceph ran through the 2 >>>> standby MDSes, crashing them all, until it reached the final MDS and is >>>> now stuck in this "replay(laggy)" state. Putting our MDSes into >>>> debugging mode, we can see that this MDS crashed when replaying the >>>> journal for a particular inode (this is the same for all the MDSes and >>>> they all crash on the same object): >>>> >>>> ... >>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>> EMetaBlob.replay for [521,head] had [inode 0x1005ba89481 >>>> [...539,head] >>>> >>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/ >>>> auth fragtree_t(*^2 00*^3 00000*^ >>>> 4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4 >>>> 01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4 01110*^4 >>>> 01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4 10101*^3 >>>> 10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22 >>>> T05:41:10.213700+0100 1281276=1281276+0) n(v12 >>>> rc2024-06-22T05:41:10.213700+0100 b1348251683896 1281277=1281276+1) >>>> old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80] >>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>> EMetaBlob.replay dir 0x1005ba89481.011011000* >>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>> EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000* >>>> >>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/ >>>> [2,head] auth v=436385 cv=0/0 state=107374182 >>>> 4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12 >>>> rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0) >>>> hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80] >>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>> EMetaBlob.replay added (full) [dentry >>>> >>>> #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti >>>> ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion >>>> lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1 >>>> 0x557710444500] >>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>> EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head] >>>> >>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_ >>>> 005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384 s=0 n(v0 >>>> 1=1+0) (iversion lock) cr={99995144=0-4194304@538} 0x557710438680] >>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 >>>> mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent >>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>> EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35 [539,head] >>>> >>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762 >>>> 6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth >>>> v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock) >>>> cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1 0x557710438680] >>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>> EMetaBlob.replay inotable tablev 3112837 <= table 3112837 >>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>> EMetaBlob.replay sessionmap v 1560540883, table 1560540882 prealloc >>>> [] used 0x1005cd4fe35 >>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 -1 >>>> >>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: >>>> I >>>> n function 'void interval_set<T, C>::erase(T, T, >>>> std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' >>>> thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100 >>>> >>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: >>>> 568: FAILED ceph_assert(p->first <= start) >>>> >>>> ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) >>>> quincy (stable) >>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>> const*)+0x135) [0x7f8821e814a3] >>>> 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f8821e81669] >>>> 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, >>>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5576f9bb2885] >>>> 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, >>>> MDPeerUpdate*)+0x4377) [0x5576f9eb77b7] >>>> 5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1] >>>> 6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b] >>>> 7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041] >>>> 8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca] >>>> 9: clone() >>>> >>>> I've only included a short section of the crash (this is the first >>>> trace >>>> in the log with regards to the crash with a 10/20 debug_mds option). We >>>> tried deleting the 0x1005cd4fe35 object from the object store using the >>>> "rados" command but this did not allow our MDS to successfully replay. >>>> >>>> From my understanding the journal seems okay as we didn't run out of >>>> space for example on our metadata pool and "cephfs-journal-tool journal >>>> inspect" doesn't seem to think there is any damage: >>>> >>>> root@pebbles-s1 13:58 [~]: cephfs-journal-tool --rank=ceph_backup:0 >>>> journal inspect >>>> Overall journal integrity: OK >>>> root@pebbles-s1 14:04 [~]: cephfs-journal-tool --rank=ceph_backup:0 >>>> event get --inode 1101069090357 summary >>>> Events by type: >>>> OPEN: 1 >>>> UPDATE: 3 >>>> Errors: 0 >>>> root@pebbles-s1 14:05 [~]: cephfs-journal-tool --rank=ceph_backup:0 >>>> event get --inode 1101069090357 list >>>> 2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE: (openc) >>>> >>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>> 2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE: >>>> (check_inode_max_size) >>>> >>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>> >>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>> 2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN: () >>>> >>>> FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>> 2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE: (rename) >>>> >>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>> >>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc >>>> >>>> I was wondering whether anyone had any advice for us on how we should >>>> proceed forward? We were thinking about manually applying these events >>>> (via "event apply") where failing that we could erase this problematic >>>> event with "cephfs-journal-tool --rank=ceph_backup:0 event splice >>>> --inode 1101069090357". Is this a good idea? We would rather not >>>> rebuild >>>> the entire metadata pool if we could avoid it (once was enough for us) >>>> as this cluster has ~9 PB of data on it. >>>> >>>> Kindest regards, >>>> >>>> Ivan Clayson >>>> >>>> -- >>>> Ivan Clayson >>>> ----------------- >>>> Scientific Computing Officer >>>> Room 2N249 >>>> Structural Studies >>>> MRC Laboratory of Molecular Biology >>>> Francis Crick Ave, Cambridge >>>> CB2 0QH >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> >>> -- >>> Ivan Clayson >>> ----------------- >>> Scientific Computing Officer >>> Room 2N249 >>> Structural Studies >>> MRC Laboratory of Molecular Biology >>> Francis Crick Ave, Cambridge >>> CB2 0QH >>> >>> -- >> Ivan Clayson >> ----------------- >> Scientific Computing Officer >> Room 2N249 >> Structural Studies >> MRC Laboratory of Molecular Biology >> Francis Crick Ave, Cambridge >> CB2 0QH >> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx