Hi Tim,Alma8's active support ended in May this year and henceforth there are only security updates. But you make a good point and we are moving toward Alma9 very shortly!
Whilst we're mentioning distributions, we've had quite a good experience with Alma (notwithstanding our current but unrelated troubles) and we would recommend it.
Kindest regards, Ivan On 09/07/2024 16:19, Tim Holloway wrote:
CAUTION: This email originated from outside of the LMB: .-timh@xxxxxxxxxxxxx-. Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phishing@xxxxxxxxxxxxxxxxx -- Ivan, This may be a little off-topic, but if you're still running AlmaLinux 8,9, it's worth noting that CentOS 8 actually end-of-lifed about 2 years ago, thanks to CentOS Stream. Up until this last week, however, I had several AlmaLinux 8 machines running myself, but apparently somewhere around May IBM Red Hat pulled all of its CentOS8 enterprise sites offline, including Storage and Ceph, which broke my yum updates. While as far as I'm aware, once you've installed cephadm (whether via yum/dnf or otherwise), there's no further need for the RPM repos, losing yum support is not helping at the very least. On the upside, it's possible to upgrade-in-place from AlmaLinux 8.9 to AlmaLinux 9, although it may require temporarily disabling certain OS services to appease the upgrade process. Probably won't solve your problem, but at least you'll be able to move fairly painlessly to a better-supported platform. Best Regards, Tim On Tue, 2024-07-09 at 11:14 +0100, Ivan Clayson wrote:Hi Dhairya, I would be more than happy to try and give as many details as possible but the slack channel is private and requires my email to have an account/ access to it. Wouldn't taking the discussion about this error to a private channel also stop other users who experience this error from learning about how and why this happened as well as possibly not be able to view the solution? Would it not be possible to discuss this more publicly for the benefit of the other users on the mailing list? Kindest regards, Ivan On 09/07/2024 10:44, Dhairya Parmar wrote:CAUTION: This email originated from outside of the LMB: *.-dparmar@xxxxxxxxxx-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phishing@xxxxxxxxxxxxxxxxx -- Hey Ivan, This is a relatively new MDS crash, so this would require some investigation but I was instructed to recommend disaster-recovery steps [0] (except session reset) to you to get the FS up again. This crash is being discussed on upstream CephFS slack channel [1] with @Venky Shankar <mailto:vshankar@xxxxxxxxxx> and other CephFS devs. I'd encourage you to join the conversation, we can discuss this in detail and maybe go through the incident step by step which should help analyse the crash better. [0] https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts [1] https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519 On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote: Hi Dhairya, Thank you ever so much for having another look at this so quickly. I don't think I have any logs similar to the ones you referenced this time as my MDSs don't seem to enter the replay stage when they crash (or at least don't now after I've thrown the logs away) but those errors do crop up in the prior logs I shared when the system first crashed. Kindest regards, Ivan On 08/07/2024 14:08, Dhairya Parmar wrote:CAUTION: This email originated from outside of the LMB: *.-dparmar@xxxxxxxxxx-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phishing@xxxxxxxxxxxxxxxxx -- Ugh, something went horribly wrong. I've downloaded the MDS logs that contain assertion failure and it looks relevant to this [0]. Do you have client logs for this? The other log that you shared is being downloaded right now, once that's done and I'm done going through it, I'll update you. [0] https://tracker.ceph.com/issues/54546 On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote: Hi Dhairya, Sorry to resurrect this thread again, but we still unfortunately have an issue with our filesystem after we attempted to write new backups to it. We finished the scrub of the filesystem on Friday and ran a repair scrub on the 1 directory which had metadata damage. After doing so and rebooting, the cluster reported no issues and data was accessible again. We re-started the backups to run over the weekend and unfortunately the filesystem crashed again where the log of the failure is here:https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. We ran the backups on kernel mounts of the filesystem without the nowsync option this time to avoid the out-of-sync write problems.. I've tried resetting the journal again after recovering the dentries but unfortunately the filesystem is still in a failed state despite setting joinable to true. The log of this crash is here:https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708. I'm not sure how to proceed as I can't seem to get any MDS to take over the first rank. I would like to do a scrub of the filesystem and preferably overwrite the troublesome files with the originals on the live filesystem. Do you have any advice on how to make the filesystem leave its failed state? I have a backup of the journal before I reset it so I can roll back if necessary. Here are some details about the filesystem at present: root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status cluster: id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 health: HEALTH_ERR 1 filesystem is degraded 1 large omap objects 1 filesystem is offline 1 mds daemon damaged nobackfill,norebalance,norecover,noscrub,nodeep- scrub,nosnaptrim flag(s) set 1750 pgs not deep-scrubbed in time 1612 pgs not scrubbed in time services: mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m) mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3, pebbles-s4 mds: 1/2 daemons up, 3 standby osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped pgs flags nobackfill,norebalance,norecover,noscrub,nodeep- scrub,nosnaptrim data: volumes: 1/2 healthy, 1 recovering; 1 damaged pools: 7 pools, 2177 pgs objects: 3.24G objects, 6.7 PiB usage: 8.6 PiB used, 14 PiB / 23 PiB avail pgs: 11785954/27384310061 objects misplaced (0.043%) 2167 active+clean 6 active+remapped+backfilling 4 active+remapped+backfill_wait ceph_backup - 0 clients =========== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 failed POOL TYPE USED AVAIL mds_backup_fs metadata 1174G 3071G ec82_primary_fs_data data 0 3071G ec82pool data 8085T 4738T ceph_archive - 2 clients ============ RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active pebbles-s4 Reqs: 0 /s 13.4k 7105 118 2 POOL TYPE USED AVAIL mds_archive_fs metadata 5184M 3071G ec83_primary_fs_data data 0 3071G ec83pool data 138T 4307T STANDBY MDS pebbles-s2 pebbles-s3 pebbles-s1 MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) root@pebbles-s2 11:55 [~]: ceph fs dump e2643889 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 1 Filesystem 'ceph_backup' (1) fs_name ceph_backup epoch 2643888 flags 12 joinable allow_snaps allow_multimds_snaps created 2023-05-19T12:52:36.302135+0100 modified 2024-07-08T11:17:55.437861+0100 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 109934182400000 required_client_features {} last_failure 0 last_failure_osd_epoch 494515 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {} failed damaged 0 stopped data_pools [6,3] metadata_pool 2 inline_data disabled balancer standby_count_wanted 1 Kindest regards, Ivan On 28/06/2024 15:17, Dhairya Parmar wrote:CAUTION: This email originated from outside of the LMB: *.-dparmar@xxxxxxxxxx-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phishing@xxxxxxxxxxxxxxxxx -- On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote: Hi Dhairya, I would be more than happy to share our corrupted journal. Has the host key changed for drop.ceph.com <http://drop.ceph.com>? The fingerprint I'm being sent is 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to the one in our /usr/share/ceph/known_hosts_drop.ceph.com <http://known_hosts_drop.ceph.com>. Ah, strange. Let me get in touch with folks who might know about this, will revert back to you ASAP Thank you for your advice as well. We've reset our MDS' journal and are currently in the process of a full filesystem scrub which understandably is taking quite a bit of time but seems to be progressing through the objects fine. YAY! Thank you ever so much for all your help and please do feel free to follow up with us if you would like any further details about our crash! Glad to hear it went well, this bug is being worked on with high priority and once the patch is ready, it will be backported. The root cause of this issue is the `nowsync` (async dirops) being enabled by default with kclient [0]. This feature allows asynchronous creation and deletion of files, optimizing performance by avoiding round-trip latency for these system calls. However, in very rare cases (like yours :D), it can affect the system's consistency and stability hence if this kind of optimization is not a priority for your workload, I recommend turning it off by switching the mount points to `wsync` and also set the MDS config `mds_client_delegate_inos_pct` to `0` so that you don't end up in this situation again (until the bug fix arrives :)). [0]https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902Kindest regards, Ivan On 27/06/2024 12:39, Dhairya Parmar wrote:CAUTION: This email originated from outside of the LMB: *.-dparmar@xxxxxxxxxx-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phishing@xxxxxxxxxxxxxxxxx -- Hi Ivan, The solution (which has been successful for us in the past) is to reset the journal. This would bring the fs back online and return the MDSes to a stable state, but some data would be lost—the data in the journal that hasn't been flushed to the backing store would be gone. Therefore, you should try to flush out as much journal data as possible before resetting the journal. Here are the steps for this entire process: 1) Bring the FS offline $ ceph fs fail <fs_name> 2) Recover dentries from journal (run it with every MDS Rank) $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event recover_dentries summary 3) Reset the journal (again with every MDS Rank) $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> journal reset 4) Bring the FS online $ cephfs fs set <fs_name> joinable true 5) Restart the MDSes 6) Perform scrub to ensure consistency of fs $ ceph tell mds.<fs_name>:0 scrub start <path> [scrubopts] [tag] # you could try a recursive scrub maybe `ceph tell mds.<fs_name>:0 scrub start / recursive` Some important notes to keep in mind: * Recovering dentries will take time (generally, rank 0 is the most time-consuming, but the rest should be quick). * cephfs-journal-tool and metadata OSDs are bound to use a significant CPU percentage. This is because cephfs-journal-tool has to swig the journal data and flush it out to the backing store, which also makes the metadata operations go rampant, resulting in OSDs taking a significant percentage of CPU. Do let me know how this goes. On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote: Hi Dhairya, We can induce the crash by simply restarting the MDS and the crash seems to happen when an MDS goes from up:standby to up:replay. The MDS works through a few files in the log before eventually crashing where I've included the logs for this here (this is after I imported the backed up journal which I hope was successful but please let me know if you suspect it wasn't!):https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.logWith respect to the client logs, are you referring to the clients who are writing to the filesystem? We don't typically run them in any sort of debug mode and we have quite a few machines running our backup system but we can look an hour or so before the first MDS crash (though I don't know if this is when the de-sync occurred). Here are some MDS logs with regards to the initial crash on Saturday morning though which may be helpful: -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-06-22T05:41:13.091556+0100) -57> 2024-06-22T05:41:43.208+0100 7f184de84700 1 mds.pebbles-s2 Updating MDS map to version 2529650 from mon.3 -56> 2024-06-22T05:41:43.208+0100 7f184de84700 4 mds.0.purge_queue operator(): data pool 6 not found in OSDMap -55> 2024-06-22T05:41:43.208+0100 7f184de84700 4 mds.0.purge_queue operator(): data pool 3 not found in OSDMap -54> 2024-06-22T05:41:43.209+0100 7f184de84700 5 asok(0x5592e7968000) register_command objecter_requests hook 0x5592e78f8800 -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: _renew_subs -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0> -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10 log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: prio: info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201) -50> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.purge_queue operator(): data pool 6 not found in OSDMap -49> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.purge_queue operator(): data pool 3 not found in OSDMap -48> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.0 apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0) -47> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 handle_mds_map i am now mds.0.2529650 -46> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 handle_mds_map state change up:standby --> up:replay -45> 2024-06-22T05:41:43.209+0100 7f184de84700 5 mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay -44> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 replay_start -43> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 waiting for osdmap 473739 (which blocklists prior instance) -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0> -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700 2 mds.0.cache Memory usage: total 299012, rss 37624, heap 182556, baseline 182556, 0 / 0 inodes have caps, 0 caps, 0 caps per inode -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: _renew_subs -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0> -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: handle_get_version_reply finishing 1 version 473739 -37> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 Booting: 0: opening inotable -36> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 Booting: 0: opening sessionmap -35> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 Booting: 0: opening mds log -34> 2024-06-22T05:41:43.224+0100 7f1847e78700 5 mds.0.log open discovering log bounds -33> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 Booting: 0: opening purge queue (async) -32> 2024-06-22T05:41:43.224+0100 7f1847e78700 4 mds.0.purge_queue open: opening -31> 2024-06-22T05:41:43.224+0100 7f1847e78700 1 mds.0.journaler.pq(ro) recover start -30> 2024-06-22T05:41:43.224+0100 7f1847e78700 1 mds.0.journaler.pq(ro) read_head -29> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 Booting: 0: loading open file table (async) -28> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 Booting: 0: opening snap table -27> 2024-06-22T05:41:43.224+0100 7f1847677700 4 mds.0.journalpointer Reading journal pointer '400.00000000' -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10 monclient: get_auth_request con 0x5592e8987000 auth_method 0 -25> 2024-06-22T05:41:43.225+0100 7f1850e8a700 10 monclient: get_auth_request con 0x5592e8987c00 auth_method 0 -24> 2024-06-22T05:41:43.252+0100 7f1848e7a700 1 mds.0.journaler.pq(ro) _finish_read_head loghead(trim 231160676352, expire 231163662875, write 231163662875, stream_format 1). probing for end of log (from 231163662875)... -23> 2024-06-22T05:41:43.252+0100 7f1848e7a700 1 mds.0.journaler.pq(ro) probing for end of the log -22> 2024-06-22T05:41:43.252+0100 7f1847677700 1 mds.0.journaler.mdlog(ro) recover start -21> 2024-06-22T05:41:43.252+0100 7f1847677700 1 mds.0.journaler.mdlog(ro) read_head -20> 2024-06-22T05:41:43.252+0100 7f1847677700 4 mds.0.log Waiting for journal 0x200 to recover... -19> 2024-06-22T05:41:43.252+0100 7f1850689700 10 monclient: get_auth_request con 0x5592e8bc6000 auth_method 0 -18> 2024-06-22T05:41:43.253+0100 7f185168b700 10 monclient: get_auth_request con 0x5592e8bc6800 auth_method 0 -17> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim 90131453181952, expire 90131465778558, write 90132009715463, stream_format 1). probing for end of log (from 90132009715463)... -16> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 mds.0.journaler.mdlog(ro) probing for end of the log -15> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 90132019384791 (header had 90132009715463). recovered. -14> 2024-06-22T05:41:43.257+0100 7f1847677700 4 mds.0.log Journal 0x200 recovered. -13> 2024-06-22T05:41:43.257+0100 7f1847677700 4 mds.0.log Recovered journal 0x200 in format 1 -12> 2024-06-22T05:41:43.273+0100 7f1848e7a700 1 mds.0.journaler.pq(ro) _finish_probe_end write_pos = 231163662875 (header had 231163662875). recovered. -11> 2024-06-22T05:41:43.273+0100 7f1848e7a700 4 mds.0.purge_queue operator(): open complete -10> 2024-06-22T05:41:43.273+0100 7f1848e7a700 1 mds.0.journaler.pq(ro) set_writeable -9> 2024-06-22T05:41:43.441+0100 7f1847e78700 2 mds.0.2529650 Booting: 1: loading/discovering base inodes -8> 2024-06-22T05:41:43.441+0100 7f1847e78700 0 mds.0.cache creating system inode with ino:0x100 -7> 2024-06-22T05:41:43.442+0100 7f1847e78700 0 mds.0.cache creating system inode with ino:0x1 -6> 2024-06-22T05:41:43.442+0100 7f1847e78700 2 mds.0.2529650 Booting: 2: replaying mds log -5> 2024-06-22T05:41:43.442+0100 7f1847e78700 2 mds.0.2529650 Booting: 2: waiting for purge queue recovered -4> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: tick -3> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-06-22T05:41:14.091638+0100) -2> 2024-06-22T05:41:44.210+0100 7f1849e7c700 2 mds.0.cache Memory usage: total 588368, rss 308304, heap 207132, baseline 182556, 0 / 15149 inodes have caps, 0 caps, 0 caps per inode -1> 2024-06-22T05:41:44.642+0100 7f1846675700 -1 /home/jenkins-build/build/workspace/ceph- build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos 8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/B UILD/ceph-17.2.7/src/include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7f1846675700 time 2024-06-22T05:41:44.643146+0100 ceph version 17.2.7(b12291d110049b2f35e32e0de30d70e9a4c060d2)quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f18568b64a3] 2: /usr/lib64/ceph/libceph- common.so.2(+0x269669) [0x7f18568b6669] 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885] 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4377) [0x5592e532c7b7] 5: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1] 6: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b] 7: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041] 8: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca] 9: clone() 0> 2024-06-22T05:41:44.643+0100 7f1846675700 -1 *** Caught signal (Aborted) ** in thread 7f1846675700 thread_name:md_log_replay ceph version 17.2.7(b12291d110049b2f35e32e0de30d70e9a4c060d2)quincy (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f18558aecf0] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f18568b64fd] 5: /usr/lib64/ceph/libceph- common.so.2(+0x269669) [0x7f18568b6669] 6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885] 7: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4377) [0x5592e532c7b7] 8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1] 9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b] 10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041] 11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca] 12: clone() We have a relatively low debug setting normally so I don't think many details of the initial crash were captured unfortunately and the MDS logs before the above (i.e. "-60" and older) are just beacon messages and _check_auth_rotating checks. I was wondering whether you have any recommendations in terms of what actions we could take to bring our filesystem back into a working state short of rebuilding the entire metadata pool? We are quite keen to bring our backup back into service urgently as we currently do not have any accessible backups for our Ceph clusters. Kindest regards, Ivan On 25/06/2024 19:18, Dhairya Parmar wrote:CAUTION: This email originated from outside of the LMB: *.-dparmar@xxxxxxxxxx-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phishing@xxxxxxxxxxxxxxxxx -- On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote: Hi Dhairya, Thank you for your rapid reply. I tried recovering the dentries for the file just before the crash I mentioned before and then splicing the transactions from the journal which seemed to remove that issue for that inode but resulted in the MDS crashing on the next inode in the journal when performing replay. The MDS delegates a range of preallocated inodes (in form of a set - interval_set<inodeno_t> preallocated_inos) to the clients, so it can be one inode that is untracked or some inodes from the range or in worst case scenario - ALL, and this is something that even the `cephfs-journal-tool` would not be able to tell (since we're talking about MDS internals which aren't exposed to such tools). That is the reason why you see "MDS crashing on the next inode in the journal when performing replay". An option could be to expose the inode set to some tool or asok cmd to identify such inodes ranges, which needs to be discussed. For now, we're trying to address this in [0], you can follow the discussion there. [0] https://tracker.ceph.com/issues/66251 Removing all the transactions involving the directory housing the files that seemed to cause these crashes from the journal only caused the MDS to fail to even start replay. I've rolled back our journal to our original version when the crash first happened and the entire MDS log for the crash can be found here:https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24Awesome, this would help us a ton. Apart from this, would it be possible to send us client logs? Please let us know if you would like any other logs file as we can easily induce this crash. Since you can easily induce the crash, can you share the reproducer please i.e. what all action you take in order to hit this? Kindest regards, Ivan On 25/06/2024 09:58, Dhairya Parmar wrote:CAUTION: This email originated from outside of the LMB: *.-dparmar@xxxxxxxxxx-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phishing@xxxxxxxxxxxxxxxxx -- Hi Ivan, This looks to be similar to the issue [0] that we're already addressing at [1]. So basically there is some out-of-sync event that led the client to make use of the inodes that MDS wasn't aware of/isn't tracking and hence the crash. It'd be really helpful if you can provide us more logs. CC @Rishabh Dave <mailto:ridave@xxxxxxxxxx> @Venky Shankar <mailto:vshankar@xxxxxxxxxx> @Patrick Donnelly <mailto:pdonnell@xxxxxxxxxx> @Xiubo Li <mailto:xiubli@xxxxxxxxxx> [0] https://tracker.ceph.com/issues/61009 [1] https://tracker.ceph.com/issues/66251 -- ***Dhairya Parmar* Associate Software Engineer, CephFS <https://www.redhat.com/>IBM, Inc. On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote: Hello, We have been experiencing a serious issue with our CephFS backup cluster running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are constantly in a "replay" or "replay(laggy)" state and keep crashing. We have a single MDS filesystem called "ceph_backup" with 2 standby MDSes along with a 2nd unused filesystem "ceph_archive" (this holds little to no data) where we are using our "ceph_backup" filesystem to backup our data and this is the one which is currently broken. The Ceph health outputs currently are: root@pebbles-s1 14:05 [~]: ceph - s cluster: id: e3f7535e-d35f-4a5d-88f0- a1e97abcd631 health: HEALTH_WARN 1 filesystem is degraded insufficient standby MDS daemons available 1319 pgs not deep-scrubbed in time 1054 pgs not scrubbed in time services: mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles- s3,pebbles-s4 (age 36m) mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4, pebbles-s3, pebbles-s1 mds: 2/2 daemons up osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37 remapped pgs data: volumes: 1/2 healthy, 1 recovering pools: 7 pools, 2177 pgs objects: 3.55G objects, 7.0 PiB usage: 8.9 PiB used, 14 PiB / 23 PiB avail pgs: 83133528/30006841533 objects misplaced (0.277%) 2090 active+clean 47 active+clean+scrubbing+deep 29 active+remapped+backfilling 8 active+remapped+backfill_wait 2 active+clean+scrubbing 1 active+clean+snaptrim io: recovery: 1.9 GiB/s, 719 objects/s root@pebbles-s1 14:09 [~]: ceph fs status ceph_backup - 0 clients =========== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 replay(laggy) pebbles-s3 0 0 0 0 POOL TYPE USED AVAIL mds_backup_fs metadata 1255G 2780G ec82_primary_fs_data data 0 2780G ec82pool data 8442T 3044T ceph_archive - 2 clients ============ RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active pebbles-s2 Reqs: 0 /s 13.4k 7105 118 2 POOL TYPE USED AVAIL mds_archive_fs metadata 5184M 2780G ec83_primary_fs_data data 0 2780G ec83pool data 138T 2767T MDS version: ceph version 17.2.7(b12291d110049b2f35e32e0de30d70e9a4c060d2)quincy (stable) root@pebbles-s1 14:09 [~]: ceph health detail | head HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs not scrubbed in time [WRN] FS_DEGRADED: 1 filesystem is degraded fs ceph_backup is degraded [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available have 0; want 1 more When our cluster first ran after a reboot, Ceph ran through the 2 standby MDSes, crashing them all, until it reached the final MDS and is now stuck in this "replay(laggy)" state. Putting our MDSes into debugging mode, we can see that this MDS crashed when replaying the journal for a particular inode (this is the same for all the MDSes and they all crash on the same object): ... 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal EMetaBlob.replay for [521,head] had [inode 0x1005ba89481 [...539,head] /cephfs- users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3 /cryolo/test_micrographs/ auth fragtree_t(*^2 00*^3 00000*^ 4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4 01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4 01110*^4 01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4 10101*^3 10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22 T05:41:10.213700+0100 1281276=1281276+0) n(v12 rc2024-06-22T05:41:10.213700+0100 b1348251683896 1281277=1281276+1) old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80] 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal EMetaBlob.replay dir 0x1005ba89481.011011000* 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000* /cephfs- users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3 /cryolo/test_micrographs/ [2,head] auth v=436385 cv=0/0 state=107374182 4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12 rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0) hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80] 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal EMetaBlob.replay added (full) [dentry #0x1/cephfs- users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3 /cryolo/test_micrographs/FoilHole_27649821_Data_27626128_ 27626130_20210628_005006_fracti ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1 0x557710444500] 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head] /cephfs- users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3 /cryolo/test_micrographs/FoilHole_27649821_Data_27626128_ 27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partialauth v436384 s=0 n(v0 1=1+0) (iversion lock) cr={99995144=0-4194304@538} 0x557710438680] 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35 [539,head] /cephfs- users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3 /cryolo/test_micrographs/FoilHole_27649821_Data_27626128_ 27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock) cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1 0x557710438680] 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal EMetaBlob.replay inotable tablev 3112837 <= table 3112837 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal EMetaBlob.replay sessionmap v 1560540883, table 1560540882 prealloc [] used 0x1005cd4fe35 2024-06-24T13:44:55.563+0100 7f8811c40700 -1 /home/jenkins- build/build/workspace/ceph- build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/ce ntos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/r pm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: I n function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100 /home/jenkins- build/build/workspace/ceph- build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/ce ntos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/r pm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: 568: FAILED ceph_assert(p-first <=start) ceph version 17.2.7(b12291d110049b2f35e32e0de30d70e9a4c060d2)quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f8821e814a3] 2: /usr/lib64/ceph/libceph- common.so.2(+0x269669) [0x7f8821e81669] 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5576f9bb2885] 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4377) [0x5576f9eb77b7] 5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1] 6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b] 7:(MDLog::ReplayThread::entry()+0x11)[0x5576f9af5041] 8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca] 9: clone() I've only included a short section of the crash (this is the first trace in the log with regards to the crash with a 10/20 debug_mds option). We tried deleting the 0x1005cd4fe35 object from the object store using the "rados" command but this did not allow our MDS to successfully replay. From my understanding the journal seems okay as we didn't run out of space for example on our metadata pool and "cephfs-journal-tool journal inspect" doesn't seem to think there is any damage: root@pebbles-s1 13:58 [~]: cephfs-journal-tool -- rank=ceph_backup:0 journal inspect Overall journal integrity: OK root@pebbles-s1 14:04 [~]: cephfs-journal-tool -- rank=ceph_backup:0 event get --inode 1101069090357 summary Events by type: OPEN: 1 UPDATE: 3 Errors: 0 root@pebbles-s1 14:05 [~]: cephfs-journal-tool -- rank=ceph_backup:0 event get --inode 1101069090357 list 2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE: (openc)test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial 2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE: (check_inode_max_size)test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partialtest_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial 2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN: ()FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial 2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE: (rename)test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partialtest_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc I was wondering whether anyone had any advice for us on how we should proceed forward? We were thinking about manually applying these events (via "event apply") where failing that we could erase this problematic event with "cephfs-journal-tool --rank=ceph_backup:0 event splice --inode 1101069090357". Is this a good idea? We would rather not rebuild the entire metadata pool if we could avoid it (once was enough for us) as this cluster has ~9 PB of data on it. Kindest regards, Ivan Clayson -- Ivan Clayson ----------------- Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH_______________________________________________ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx-- Ivan Clayson ----------------- Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH-- Ivan Clayson ----------------- Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH-- Ivan Clayson ----------------- Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH-- Ivan Clayson ----------------- Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH-- Ivan Clayson ----------------- Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH-- Ivan Clayson ----------------- Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
-- Ivan Clayson ----------------- Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx