Hey Ivan, This is a relatively new MDS crash, so this would require some investigation but I was instructed to recommend disaster-recovery steps [0] (except session reset) to you to get the FS up again. This crash is being discussed on upstream CephFS slack channel [1] with @Venky Shankar <vshankar@xxxxxxxxxx> and other CephFS devs. I'd encourage you to join the conversation, we can discuss this in detail and maybe go through the incident step by step which should help analyse the crash better. [0] https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts [1] https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519 On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote: > Hi Dhairya, > > Thank you ever so much for having another look at this so quickly. I don't > think I have any logs similar to the ones you referenced this time as my > MDSs don't seem to enter the replay stage when they crash (or at least > don't now after I've thrown the logs away) but those errors do crop up in > the prior logs I shared when the system first crashed. > > Kindest regards, > > Ivan > On 08/07/2024 14:08, Dhairya Parmar wrote: > > CAUTION: This email originated from outside of the LMB: > *.-dparmar@xxxxxxxxxx-.* > Do not click links or open attachments unless you recognize the sender and > know the content is safe. > If you think this is a phishing email, please forward it to > phishing@xxxxxxxxxxxxxxxxx > > > -- > Ugh, something went horribly wrong. I've downloaded the MDS logs that > contain assertion failure and it looks relevant to this [0]. Do you have > client logs for this? > > The other log that you shared is being downloaded right now, once that's > done and I'm done going through it, I'll update you. > > [0] https://tracker.ceph.com/issues/54546 > > On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> > wrote: > >> Hi Dhairya, >> >> Sorry to resurrect this thread again, but we still unfortunately have an >> issue with our filesystem after we attempted to write new backups to it. >> >> We finished the scrub of the filesystem on Friday and ran a repair scrub >> on the 1 directory which had metadata damage. After doing so and rebooting, >> the cluster reported no issues and data was accessible again. >> >> We re-started the backups to run over the weekend and unfortunately the >> filesystem crashed again where the log of the failure is here: >> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. >> We ran the backups on kernel mounts of the filesystem without the nowsync >> option this time to avoid the out-of-sync write problems.. >> >> I've tried resetting the journal again after recovering the dentries but >> unfortunately the filesystem is still in a failed state despite setting >> joinable to true. The log of this crash is here: >> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708 >> . >> >> I'm not sure how to proceed as I can't seem to get any MDS to take over >> the first rank. I would like to do a scrub of the filesystem and preferably >> overwrite the troublesome files with the originals on the live filesystem. >> Do you have any advice on how to make the filesystem leave its failed >> state? I have a backup of the journal before I reset it so I can roll back >> if necessary. >> >> Here are some details about the filesystem at present: >> >> root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status >> cluster: >> id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 >> health: HEALTH_ERR >> 1 filesystem is degraded >> 1 large omap objects >> 1 filesystem is offline >> 1 mds daemon damaged >> >> nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set >> 1750 pgs not deep-scrubbed in time >> 1612 pgs not scrubbed in time >> >> services: >> mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 >> (age 50m) >> mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3, >> pebbles-s4 >> mds: 1/2 daemons up, 3 standby >> osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped >> pgs >> flags >> nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim >> >> data: >> volumes: 1/2 healthy, 1 recovering; 1 damaged >> pools: 7 pools, 2177 pgs >> objects: 3.24G objects, 6.7 PiB >> usage: 8.6 PiB used, 14 PiB / 23 PiB avail >> pgs: 11785954/27384310061 objects misplaced (0.043%) >> 2167 active+clean >> 6 active+remapped+backfilling >> 4 active+remapped+backfill_wait >> >> ceph_backup - 0 clients >> =========== >> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS >> 0 failed >> POOL TYPE USED AVAIL >> mds_backup_fs metadata 1174G 3071G >> ec82_primary_fs_data data 0 3071G >> ec82pool data 8085T 4738T >> ceph_archive - 2 clients >> ============ >> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS >> 0 active pebbles-s4 Reqs: 0 /s 13.4k 7105 118 2 >> POOL TYPE USED AVAIL >> mds_archive_fs metadata 5184M 3071G >> ec83_primary_fs_data data 0 3071G >> ec83pool data 138T 4307T >> STANDBY MDS >> pebbles-s2 >> pebbles-s3 >> pebbles-s1 >> MDS version: ceph version 17.2.7 >> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) >> root@pebbles-s2 11:55 [~]: ceph fs dump >> e2643889 >> enable_multiple, ever_enabled_multiple: 1,1 >> default compat: compat={},rocompat={},incompat={1=base v0.20,2=client >> writeable ranges,3=default file layouts on dirs,4=dir inode in separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no >> anchor table,9=file layout v2,10=snaprealm v2} >> legacy client fscid: 1 >> >> Filesystem 'ceph_backup' (1) >> fs_name ceph_backup >> epoch 2643888 >> flags 12 joinable allow_snaps allow_multimds_snaps >> created 2023-05-19T12:52:36.302135+0100 >> modified 2024-07-08T11:17:55.437861+0100 >> tableserver 0 >> root 0 >> session_timeout 60 >> session_autoclose 300 >> max_file_size 109934182400000 >> required_client_features {} >> last_failure 0 >> last_failure_osd_epoch 494515 >> compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable >> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds >> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline >> data,8=no anchor table,9=file layout v2,10=snaprealm v2} >> max_mds 1 >> in 0 >> up {} >> failed >> damaged 0 >> stopped >> data_pools [6,3] >> metadata_pool 2 >> inline_data disabled >> balancer >> standby_count_wanted 1 >> >> >> Kindest regards, >> >> Ivan >> On 28/06/2024 15:17, Dhairya Parmar wrote: >> >> CAUTION: This email originated from outside of the LMB: >> *.-dparmar@xxxxxxxxxx-.* >> Do not click links or open attachments unless you recognize the sender >> and know the content is safe. >> If you think this is a phishing email, please forward it to >> phishing@xxxxxxxxxxxxxxxxx >> >> >> -- >> >> >> On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> >> wrote: >> >>> Hi Dhairya, >>> >>> I would be more than happy to share our corrupted journal. Has the host >>> key changed for drop.ceph.com? The fingerprint I'm being sent is >>> 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to the one >>> in our /usr/share/ceph/known_hosts_drop.ceph.com. >>> >> Ah, strange. Let me get in touch with folks who might know about this, >> will revert back to you ASAP >> >>> Thank you for your advice as well. We've reset our MDS' journal and are >>> currently in the process of a full filesystem scrub which understandably is >>> taking quite a bit of time but seems to be progressing through the objects >>> fine. >>> >> YAY! >> >>> Thank you ever so much for all your help and please do feel free to >>> follow up with us if you would like any further details about our crash! >>> >> Glad to hear it went well, this bug is being worked on with high priority >> and once the patch is ready, it will be backported. >> >> The root cause of this issue is the `nowsync` (async dirops) being >> enabled by default with kclient [0]. This feature allows asynchronous >> creation and deletion of files, optimizing performance by avoiding >> round-trip latency for these system calls. However, in very rare cases >> (like yours :D), it can affect the system's consistency and stability hence >> if this kind of optimization is not a priority for your workload, I >> recommend turning it off by switching the mount points to `wsync` and also >> set the MDS config `mds_client_delegate_inos_pct` to `0` so that you don't >> end up in this situation again (until the bug fix arrives :)). >> >> [0] >> https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902 >> >>> Kindest regards, >>> >>> Ivan >>> On 27/06/2024 12:39, Dhairya Parmar wrote: >>> >>> CAUTION: This email originated from outside of the LMB: >>> *.-dparmar@xxxxxxxxxx-.* >>> Do not click links or open attachments unless you recognize the sender >>> and know the content is safe. >>> If you think this is a phishing email, please forward it to >>> phishing@xxxxxxxxxxxxxxxxx >>> >>> >>> -- >>> Hi Ivan, >>> >>> The solution (which has been successful for us in the past) is to reset >>> the journal. This would bring the fs back online and return the MDSes to a >>> stable state, but some data would be lost—the data in the journal that >>> hasn't been flushed to the backing store would be gone. Therefore, you >>> should try to flush out as much journal data as possible before resetting >>> the journal. >>> >>> Here are the steps for this entire process: >>> >>> 1) Bring the FS offline >>> $ ceph fs fail <fs_name> >>> >>> 2) Recover dentries from journal (run it with every MDS Rank) >>> $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event recover_dentries >>> summary >>> >>> 3) Reset the journal (again with every MDS Rank) >>> $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> journal reset >>> >>> 4) Bring the FS online >>> $ cephfs fs set <fs_name> joinable true >>> >>> 5) Restart the MDSes >>> >>> 6) Perform scrub to ensure consistency of fs >>> $ ceph tell mds.<fs_name>:0 scrub start <path> [scrubopts] [tag] >>> # you could try a recursive scrub maybe `ceph tell mds.<fs_name>:0 scrub >>> start / recursive` >>> >>> Some important notes to keep in mind: >>> * Recovering dentries will take time (generally, rank 0 is the most >>> time-consuming, but the rest should be quick). >>> * cephfs-journal-tool and metadata OSDs are bound to use a significant >>> CPU percentage. This is because cephfs-journal-tool has to swig the journal >>> data and flush it out to the backing store, which also makes the metadata >>> operations go rampant, resulting in OSDs taking a significant percentage of >>> CPU. >>> >>> Do let me know how this goes. >>> >>> On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> >>> wrote: >>> >>>> Hi Dhairya, >>>> >>>> We can induce the crash by simply restarting the MDS and the crash >>>> seems to happen when an MDS goes from up:standby to up:replay. The MDS >>>> works through a few files in the log before eventually crashing where I've >>>> included the logs for this here (this is after I imported the backed up >>>> journal which I hope was successful but please let me know if you suspect >>>> it wasn't!): >>>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log >>>> >>>> With respect to the client logs, are you referring to the clients who >>>> are writing to the filesystem? We don't typically run them in any sort of >>>> debug mode and we have quite a few machines running our backup system but >>>> we can look an hour or so before the first MDS crash (though I don't know >>>> if this is when the de-sync occurred). Here are some MDS logs with regards >>>> to the initial crash on Saturday morning though which may be helpful: >>>> >>>> -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick >>>> -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: >>>> _check_auth_rotating have uptodate secrets (they expire after >>>> 2024-06-22T05:41:13.091556+0100) >>>> -57> 2024-06-22T05:41:43.208+0100 7f184de84700 1 mds.pebbles-s2 >>>> Updating MDS map to version 2529650 from mon.3 >>>> -56> 2024-06-22T05:41:43.208+0100 7f184de84700 4 mds.0.purge_queue >>>> operator(): data pool 6 not found in OSDMap >>>> -55> 2024-06-22T05:41:43.208+0100 7f184de84700 4 mds.0.purge_queue >>>> operator(): data pool 3 not found in OSDMap >>>> -54> 2024-06-22T05:41:43.209+0100 7f184de84700 5 >>>> asok(0x5592e7968000) register_command objecter_requests hook 0x5592e78f8800 >>>> -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: >>>> _renew_subs >>>> -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: >>>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 >>>> -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10 >>>> log_channel(cluster) update_config to_monitors: true to_syslog: false >>>> syslog_facility: prio: info to_graylog: false graylog_host: 127.0.0.1 >>>> graylog_port: 12201) >>>> -50> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.purge_queue >>>> operator(): data pool 6 not found in OSDMap >>>> -49> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.purge_queue >>>> operator(): data pool 3 not found in OSDMap >>>> -48> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.0 >>>> apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0) >>>> -47> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >>>> handle_mds_map i am now mds.0.2529650 >>>> -46> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >>>> handle_mds_map state change up:standby --> up:replay >>>> -45> 2024-06-22T05:41:43.209+0100 7f184de84700 5 >>>> mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay >>>> -44> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >>>> replay_start >>>> -43> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >>>> waiting for osdmap 473739 (which blocklists prior instance) >>>> -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: >>>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 >>>> -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700 2 mds.0.cache Memory >>>> usage: total 299012, rss 37624, heap 182556, baseline 182556, 0 / 0 inodes >>>> have caps, 0 caps, 0 caps per inode >>>> -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: >>>> _renew_subs >>>> -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: >>>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 >>>> -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: >>>> handle_get_version_reply finishing 1 version 473739 >>>> -37> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>>> Booting: 0: opening inotable >>>> -36> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>>> Booting: 0: opening sessionmap >>>> -35> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>>> Booting: 0: opening mds log >>>> -34> 2024-06-22T05:41:43.224+0100 7f1847e78700 5 mds.0.log open >>>> discovering log bounds >>>> -33> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>>> Booting: 0: opening purge queue (async) >>>> -32> 2024-06-22T05:41:43.224+0100 7f1847e78700 4 mds.0.purge_queue >>>> open: opening >>>> -31> 2024-06-22T05:41:43.224+0100 7f1847e78700 1 >>>> mds.0.journaler.pq(ro) recover start >>>> -30> 2024-06-22T05:41:43.224+0100 7f1847e78700 1 >>>> mds.0.journaler.pq(ro) read_head >>>> -29> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>>> Booting: 0: loading open file table (async) >>>> -28> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>>> Booting: 0: opening snap table >>>> -27> 2024-06-22T05:41:43.224+0100 7f1847677700 4 >>>> mds.0.journalpointer Reading journal pointer '400.00000000' >>>> -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10 monclient: >>>> get_auth_request con 0x5592e8987000 auth_method 0 >>>> -25> 2024-06-22T05:41:43.225+0100 7f1850e8a700 10 monclient: >>>> get_auth_request con 0x5592e8987c00 auth_method 0 >>>> -24> 2024-06-22T05:41:43.252+0100 7f1848e7a700 1 >>>> mds.0.journaler.pq(ro) _finish_read_head loghead(trim 231160676352, expire >>>> 231163662875, write 231163662875, stream_format 1). probing for end of log >>>> (from 231163662875)... >>>> -23> 2024-06-22T05:41:43.252+0100 7f1848e7a700 1 >>>> mds.0.journaler.pq(ro) probing for end of the log >>>> -22> 2024-06-22T05:41:43.252+0100 7f1847677700 1 >>>> mds.0.journaler.mdlog(ro) recover start >>>> -21> 2024-06-22T05:41:43.252+0100 7f1847677700 1 >>>> mds.0.journaler.mdlog(ro) read_head >>>> -20> 2024-06-22T05:41:43.252+0100 7f1847677700 4 mds.0.log Waiting >>>> for journal 0x200 to recover... >>>> -19> 2024-06-22T05:41:43.252+0100 7f1850689700 10 monclient: >>>> get_auth_request con 0x5592e8bc6000 auth_method 0 >>>> -18> 2024-06-22T05:41:43.253+0100 7f185168b700 10 monclient: >>>> get_auth_request con 0x5592e8bc6800 auth_method 0 >>>> -17> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 >>>> mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim 90131453181952, >>>> expire 90131465778558, write 90132009715463, stream_format 1). probing for >>>> end of log (from 90132009715463)... >>>> -16> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 >>>> mds.0.journaler.mdlog(ro) probing for end of the log >>>> -15> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 >>>> mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 90132019384791 >>>> (header had 90132009715463). recovered. >>>> -14> 2024-06-22T05:41:43.257+0100 7f1847677700 4 mds.0.log Journal >>>> 0x200 recovered. >>>> -13> 2024-06-22T05:41:43.257+0100 7f1847677700 4 mds.0.log >>>> Recovered journal 0x200 in format 1 >>>> -12> 2024-06-22T05:41:43.273+0100 7f1848e7a700 1 >>>> mds.0.journaler.pq(ro) _finish_probe_end write_pos = 231163662875 (header >>>> had 231163662875). recovered. >>>> -11> 2024-06-22T05:41:43.273+0100 7f1848e7a700 4 mds.0.purge_queue >>>> operator(): open complete >>>> -10> 2024-06-22T05:41:43.273+0100 7f1848e7a700 1 >>>> mds.0.journaler.pq(ro) set_writeable >>>> -9> 2024-06-22T05:41:43.441+0100 7f1847e78700 2 mds.0.2529650 >>>> Booting: 1: loading/discovering base inodes >>>> -8> 2024-06-22T05:41:43.441+0100 7f1847e78700 0 mds.0.cache >>>> creating system inode with ino:0x100 >>>> -7> 2024-06-22T05:41:43.442+0100 7f1847e78700 0 mds.0.cache >>>> creating system inode with ino:0x1 >>>> -6> 2024-06-22T05:41:43.442+0100 7f1847e78700 2 mds.0.2529650 >>>> Booting: 2: replaying mds log >>>> -5> 2024-06-22T05:41:43.442+0100 7f1847e78700 2 mds.0.2529650 >>>> Booting: 2: waiting for purge queue recovered >>>> -4> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: tick >>>> -3> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: >>>> _check_auth_rotating have uptodate secrets (they expire after >>>> 2024-06-22T05:41:14.091638+0100) >>>> -2> 2024-06-22T05:41:44.210+0100 7f1849e7c700 2 mds.0.cache Memory >>>> usage: total 588368, rss 308304, heap 207132, baseline 182556, 0 / 15149 >>>> inodes have caps, 0 caps, 0 caps per inode >>>> -1> 2024-06-22T05:41:44.642+0100 7f1846675700 -1 >>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: >>>> In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, >>>> T)>) [with T = inodeno_t; C = std::map]' thread 7f1846675700 time >>>> 2024-06-22T05:41:44.643146+0100 >>>> >>>> ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy >>>> (stable) >>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>> const*)+0x135) [0x7f18568b64a3] >>>> 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669] >>>> 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, >>>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885] >>>> 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, >>>> MDPeerUpdate*)+0x4377) [0x5592e532c7b7] >>>> 5: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1] >>>> 6: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b] >>>> 7: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041] >>>> 8: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca] >>>> 9: clone() >>>> >>>> 0> 2024-06-22T05:41:44.643+0100 7f1846675700 -1 *** Caught signal >>>> (Aborted) ** >>>> in thread 7f1846675700 thread_name:md_log_replay >>>> >>>> ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy >>>> (stable) >>>> 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f18558aecf0] >>>> 2: gsignal() >>>> 3: abort() >>>> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>> const*)+0x18f) [0x7f18568b64fd] >>>> 5: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669] >>>> 6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, >>>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885] >>>> 7: (EMetaBlob::replay(MDSRank*, LogSegment*, int, >>>> MDPeerUpdate*)+0x4377) [0x5592e532c7b7] >>>> 8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1] >>>> 9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b] >>>> 10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041] >>>> 11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca] >>>> 12: clone() >>>> >>>> We have a relatively low debug setting normally so I don't think many >>>> details of the initial crash were captured unfortunately and the MDS logs >>>> before the above (i.e. "-60" and older) are just beacon messages and >>>> _check_auth_rotating checks. >>>> >>>> I was wondering whether you have any recommendations in terms of what >>>> actions we could take to bring our filesystem back into a working state >>>> short of rebuilding the entire metadata pool? We are quite keen to bring >>>> our backup back into service urgently as we currently do not have any >>>> accessible backups for our Ceph clusters. >>>> >>>> Kindest regards, >>>> >>>> Ivan >>>> On 25/06/2024 19:18, Dhairya Parmar wrote: >>>> >>>> CAUTION: This email originated from outside of the LMB: >>>> *.-dparmar@xxxxxxxxxx-.* >>>> Do not click links or open attachments unless you recognize the sender >>>> and know the content is safe. >>>> If you think this is a phishing email, please forward it to >>>> phishing@xxxxxxxxxxxxxxxxx >>>> >>>> >>>> -- >>>> >>>> >>>> On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> >>>> wrote: >>>> >>>>> Hi Dhairya, >>>>> >>>>> Thank you for your rapid reply. I tried recovering the dentries for >>>>> the file just before the crash I mentioned before and then splicing the >>>>> transactions from the journal which seemed to remove that issue for that >>>>> inode but resulted in the MDS crashing on the next inode in the >>>>> journal when performing replay. >>>>> >>>> The MDS delegates a range of preallocated inodes (in form of a set - >>>> interval_set<inodeno_t> preallocated_inos) to the clients, so it can be one >>>> inode that is untracked or some inodes from the range or in worst case >>>> scenario - ALL, and this is something that even the `cephfs-journal-tool` >>>> would not be able to tell (since we're talking about MDS internals which >>>> aren't exposed to such tools). That is the reason why you see "MDS crashing >>>> on the next inode in the journal when performing replay". >>>> >>>> An option could be to expose the inode set to some tool or asok cmd to >>>> identify such inodes ranges, which needs to be discussed. For now, we're >>>> trying to address this in [0], you can follow the discussion there. >>>> >>>> [0] https://tracker.ceph.com/issues/66251 >>>> >>>>> Removing all the transactions involving the directory housing the >>>>> files that seemed to cause these crashes from the journal only caused the >>>>> MDS to fail to even start replay. >>>>> >>>> I've rolled back our journal to our original version when the crash >>>>> first happened and the entire MDS log for the crash can be found here: >>>>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24 >>>>> >>>> Awesome, this would help us a ton. Apart from this, would it be >>>> possible to send us client logs? >>>> >>>>> Please let us know if you would like any other logs file as we can >>>>> easily induce this crash. >>>>> >>>> Since you can easily induce the crash, can you share the reproducer >>>> please i.e. what all action you take in order to hit this? >>>> >>>>> Kindest regards, >>>>> >>>>> Ivan >>>>> On 25/06/2024 09:58, Dhairya Parmar wrote: >>>>> >>>>> CAUTION: This email originated from outside of the LMB: >>>>> *.-dparmar@xxxxxxxxxx-.* >>>>> Do not click links or open attachments unless you recognize the sender >>>>> and know the content is safe. >>>>> If you think this is a phishing email, please forward it to >>>>> phishing@xxxxxxxxxxxxxxxxx >>>>> >>>>> >>>>> -- >>>>> Hi Ivan, >>>>> >>>>> This looks to be similar to the issue [0] that we're already >>>>> addressing at [1]. So basically there is some out-of-sync event that led >>>>> the client to make use of the inodes that MDS wasn't aware of/isn't >>>>> tracking and hence the crash. It'd be really helpful if you can provide us >>>>> more logs. >>>>> >>>>> CC @Rishabh Dave <ridave@xxxxxxxxxx> @Venky Shankar >>>>> <vshankar@xxxxxxxxxx> @Patrick Donnelly <pdonnell@xxxxxxxxxx> @Xiubo >>>>> Li <xiubli@xxxxxxxxxx> >>>>> >>>>> [0] https://tracker.ceph.com/issues/61009 >>>>> [1] https://tracker.ceph.com/issues/66251 >>>>> -- >>>>> *Dhairya Parmar* >>>>> >>>>> Associate Software Engineer, CephFS >>>>> >>>>> <https://www.redhat.com/>IBM, Inc. >>>>> >>>>> On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> We have been experiencing a serious issue with our CephFS backup >>>>>> cluster >>>>>> running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel >>>>>> (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem >>>>>> are >>>>>> constantly in a "replay" or "replay(laggy)" state and keep crashing. >>>>>> >>>>>> We have a single MDS filesystem called "ceph_backup" with 2 standby >>>>>> MDSes along with a 2nd unused filesystem "ceph_archive" (this holds >>>>>> little to no data) where we are using our "ceph_backup" filesystem to >>>>>> backup our data and this is the one which is currently broken. The >>>>>> Ceph >>>>>> health outputs currently are: >>>>>> >>>>>> root@pebbles-s1 14:05 [~]: ceph -s >>>>>> cluster: >>>>>> id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 >>>>>> health: HEALTH_WARN >>>>>> 1 filesystem is degraded >>>>>> insufficient standby MDS daemons available >>>>>> 1319 pgs not deep-scrubbed in time >>>>>> 1054 pgs not scrubbed in time >>>>>> >>>>>> services: >>>>>> mon: 4 daemons, quorum >>>>>> pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m) >>>>>> mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4, >>>>>> pebbles-s3, pebbles-s1 >>>>>> mds: 2/2 daemons up >>>>>> osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37 >>>>>> remapped pgs >>>>>> >>>>>> data: >>>>>> volumes: 1/2 healthy, 1 recovering >>>>>> pools: 7 pools, 2177 pgs >>>>>> objects: 3.55G objects, 7.0 PiB >>>>>> usage: 8.9 PiB used, 14 PiB / 23 PiB avail >>>>>> pgs: 83133528/30006841533 objects misplaced (0.277%) >>>>>> 2090 active+clean >>>>>> 47 active+clean+scrubbing+deep >>>>>> 29 active+remapped+backfilling >>>>>> 8 active+remapped+backfill_wait >>>>>> 2 active+clean+scrubbing >>>>>> 1 active+clean+snaptrim >>>>>> >>>>>> io: >>>>>> recovery: 1.9 GiB/s, 719 objects/s >>>>>> >>>>>> root@pebbles-s1 14:09 [~]: ceph fs status >>>>>> ceph_backup - 0 clients >>>>>> =========== >>>>>> RANK STATE MDS ACTIVITY DNS INOS DIRS >>>>>> CAPS >>>>>> 0 replay(laggy) pebbles-s3 0 0 0 0 >>>>>> POOL TYPE USED AVAIL >>>>>> mds_backup_fs metadata 1255G 2780G >>>>>> ec82_primary_fs_data data 0 2780G >>>>>> ec82pool data 8442T 3044T >>>>>> ceph_archive - 2 clients >>>>>> ============ >>>>>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS >>>>>> 0 active pebbles-s2 Reqs: 0 /s 13.4k 7105 118 2 >>>>>> POOL TYPE USED AVAIL >>>>>> mds_archive_fs metadata 5184M 2780G >>>>>> ec83_primary_fs_data data 0 2780G >>>>>> ec83pool data 138T 2767T >>>>>> MDS version: ceph version 17.2.7 >>>>>> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) >>>>>> root@pebbles-s1 14:09 [~]: ceph health detail | head >>>>>> HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS >>>>>> daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs >>>>>> not >>>>>> scrubbed in time >>>>>> [WRN] FS_DEGRADED: 1 filesystem is degraded >>>>>> fs ceph_backup is degraded >>>>>> [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons >>>>>> available >>>>>> have 0; want 1 more >>>>>> >>>>>> When our cluster first ran after a reboot, Ceph ran through the 2 >>>>>> standby MDSes, crashing them all, until it reached the final MDS and >>>>>> is >>>>>> now stuck in this "replay(laggy)" state. Putting our MDSes into >>>>>> debugging mode, we can see that this MDS crashed when replaying the >>>>>> journal for a particular inode (this is the same for all the MDSes >>>>>> and >>>>>> they all crash on the same object): >>>>>> >>>>>> ... >>>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>>> EMetaBlob.replay for [521,head] had [inode 0x1005ba89481 >>>>>> [...539,head] >>>>>> >>>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/ >>>>>> auth fragtree_t(*^2 00*^3 00000*^ >>>>>> 4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4 >>>>>> 01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4 >>>>>> 01110*^4 >>>>>> 01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4 >>>>>> 10101*^3 >>>>>> 10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22 >>>>>> T05:41:10.213700+0100 1281276=1281276+0) n(v12 >>>>>> rc2024-06-22T05:41:10.213700+0100 b1348251683896 >>>>>> 1281277=1281276+1) >>>>>> old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80] >>>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>>> EMetaBlob.replay dir 0x1005ba89481.011011000* >>>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>>> EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000* >>>>>> >>>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/ >>>>>> [2,head] auth v=436385 cv=0/0 state=107374182 >>>>>> 4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12 >>>>>> rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0) >>>>>> hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80] >>>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>>> EMetaBlob.replay added (full) [dentry >>>>>> >>>>>> #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti >>>>>> ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion >>>>>> lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1 >>>>>> 0x557710444500] >>>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>>> EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head] >>>>>> >>>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_ >>>>>> 005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384 s=0 >>>>>> n(v0 >>>>>> 1=1+0) (iversion lock) cr={99995144=0-4194304@538} >>>>>> 0x557710438680] >>>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 >>>>>> mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent >>>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>>> EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35 >>>>>> [539,head] >>>>>> >>>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762 >>>>>> 6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth >>>>>> v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock) >>>>>> cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1 >>>>>> 0x557710438680] >>>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>>> EMetaBlob.replay inotable tablev 3112837 <= table 3112837 >>>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>>> EMetaBlob.replay sessionmap v 1560540883, table 1560540882 >>>>>> prealloc >>>>>> [] used 0x1005cd4fe35 >>>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 -1 >>>>>> >>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: >>>>>> I >>>>>> n function 'void interval_set<T, C>::erase(T, T, >>>>>> std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' >>>>>> thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100 >>>>>> >>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: >>>>>> 568: FAILED ceph_assert(p->first <= start) >>>>>> >>>>>> ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) >>>>>> quincy (stable) >>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>>>> const*)+0x135) [0x7f8821e814a3] >>>>>> 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) >>>>>> [0x7f8821e81669] >>>>>> 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, >>>>>> inodeno_t, >>>>>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) >>>>>> [0x5576f9bb2885] >>>>>> 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, >>>>>> MDPeerUpdate*)+0x4377) [0x5576f9eb77b7] >>>>>> 5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1] >>>>>> 6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b] >>>>>> 7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041] >>>>>> 8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca] >>>>>> 9: clone() >>>>>> >>>>>> I've only included a short section of the crash (this is the first >>>>>> trace >>>>>> in the log with regards to the crash with a 10/20 debug_mds option). >>>>>> We >>>>>> tried deleting the 0x1005cd4fe35 object from the object store using >>>>>> the >>>>>> "rados" command but this did not allow our MDS to successfully replay. >>>>>> >>>>>> From my understanding the journal seems okay as we didn't run out of >>>>>> space for example on our metadata pool and "cephfs-journal-tool >>>>>> journal >>>>>> inspect" doesn't seem to think there is any damage: >>>>>> >>>>>> root@pebbles-s1 13:58 [~]: cephfs-journal-tool >>>>>> --rank=ceph_backup:0 >>>>>> journal inspect >>>>>> Overall journal integrity: OK >>>>>> root@pebbles-s1 14:04 [~]: cephfs-journal-tool >>>>>> --rank=ceph_backup:0 >>>>>> event get --inode 1101069090357 summary >>>>>> Events by type: >>>>>> OPEN: 1 >>>>>> UPDATE: 3 >>>>>> Errors: 0 >>>>>> root@pebbles-s1 14:05 [~]: cephfs-journal-tool >>>>>> --rank=ceph_backup:0 >>>>>> event get --inode 1101069090357 list >>>>>> 2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE: (openc) >>>>>> >>>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>>>> 2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE: >>>>>> (check_inode_max_size) >>>>>> >>>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>>>> >>>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>>>> 2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN: () >>>>>> >>>>>> FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>>>> 2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE: (rename) >>>>>> >>>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>>>> >>>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc >>>>>> >>>>>> I was wondering whether anyone had any advice for us on how we should >>>>>> proceed forward? We were thinking about manually applying these >>>>>> events >>>>>> (via "event apply") where failing that we could erase this >>>>>> problematic >>>>>> event with "cephfs-journal-tool --rank=ceph_backup:0 event splice >>>>>> --inode 1101069090357". Is this a good idea? We would rather not >>>>>> rebuild >>>>>> the entire metadata pool if we could avoid it (once was enough for >>>>>> us) >>>>>> as this cluster has ~9 PB of data on it. >>>>>> >>>>>> Kindest regards, >>>>>> >>>>>> Ivan Clayson >>>>>> >>>>>> -- >>>>>> Ivan Clayson >>>>>> ----------------- >>>>>> Scientific Computing Officer >>>>>> Room 2N249 >>>>>> Structural Studies >>>>>> MRC Laboratory of Molecular Biology >>>>>> Francis Crick Ave, Cambridge >>>>>> CB2 0QH >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> >>>>> -- >>>>> Ivan Clayson >>>>> ----------------- >>>>> Scientific Computing Officer >>>>> Room 2N249 >>>>> Structural Studies >>>>> MRC Laboratory of Molecular Biology >>>>> Francis Crick Ave, Cambridge >>>>> CB2 0QH >>>>> >>>>> -- >>>> Ivan Clayson >>>> ----------------- >>>> Scientific Computing Officer >>>> Room 2N249 >>>> Structural Studies >>>> MRC Laboratory of Molecular Biology >>>> Francis Crick Ave, Cambridge >>>> CB2 0QH >>>> >>>> -- >>> Ivan Clayson >>> ----------------- >>> Scientific Computing Officer >>> Room 2N249 >>> Structural Studies >>> MRC Laboratory of Molecular Biology >>> Francis Crick Ave, Cambridge >>> CB2 0QH >>> >>> -- >> Ivan Clayson >> ----------------- >> Scientific Computing Officer >> Room 2N249 >> Structural Studies >> MRC Laboratory of Molecular Biology >> Francis Crick Ave, Cambridge >> CB2 0QH >> >> -- > Ivan Clayson > ----------------- > Scientific Computing Officer > Room 2N249 > Structural Studies > MRC Laboratory of Molecular Biology > Francis Crick Ave, Cambridge > CB2 0QH > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx