Ugh, something went horribly wrong. I've downloaded the MDS logs that contain assertion failure and it looks relevant to this [0]. Do you have client logs for this? The other log that you shared is being downloaded right now, once that's done and I'm done going through it, I'll update you. [0] https://tracker.ceph.com/issues/54546 On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote: > Hi Dhairya, > > Sorry to resurrect this thread again, but we still unfortunately have an > issue with our filesystem after we attempted to write new backups to it. > > We finished the scrub of the filesystem on Friday and ran a repair scrub > on the 1 directory which had metadata damage. After doing so and rebooting, > the cluster reported no issues and data was accessible again. > > We re-started the backups to run over the weekend and unfortunately the > filesystem crashed again where the log of the failure is here: > https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. > We ran the backups on kernel mounts of the filesystem without the nowsync > option this time to avoid the out-of-sync write problems.. > > I've tried resetting the journal again after recovering the dentries but > unfortunately the filesystem is still in a failed state despite setting > joinable to true. The log of this crash is here: > https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708 > . > > I'm not sure how to proceed as I can't seem to get any MDS to take over > the first rank. I would like to do a scrub of the filesystem and preferably > overwrite the troublesome files with the originals on the live filesystem. > Do you have any advice on how to make the filesystem leave its failed > state? I have a backup of the journal before I reset it so I can roll back > if necessary. > > Here are some details about the filesystem at present: > > root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status > cluster: > id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 > health: HEALTH_ERR > 1 filesystem is degraded > 1 large omap objects > 1 filesystem is offline > 1 mds daemon damaged > > nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set > 1750 pgs not deep-scrubbed in time > 1612 pgs not scrubbed in time > > services: > mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 > (age 50m) > mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3, > pebbles-s4 > mds: 1/2 daemons up, 3 standby > osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped > pgs > flags > nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim > > data: > volumes: 1/2 healthy, 1 recovering; 1 damaged > pools: 7 pools, 2177 pgs > objects: 3.24G objects, 6.7 PiB > usage: 8.6 PiB used, 14 PiB / 23 PiB avail > pgs: 11785954/27384310061 objects misplaced (0.043%) > 2167 active+clean > 6 active+remapped+backfilling > 4 active+remapped+backfill_wait > > ceph_backup - 0 clients > =========== > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > 0 failed > POOL TYPE USED AVAIL > mds_backup_fs metadata 1174G 3071G > ec82_primary_fs_data data 0 3071G > ec82pool data 8085T 4738T > ceph_archive - 2 clients > ============ > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > 0 active pebbles-s4 Reqs: 0 /s 13.4k 7105 118 2 > POOL TYPE USED AVAIL > mds_archive_fs metadata 5184M 3071G > ec83_primary_fs_data data 0 3071G > ec83pool data 138T 4307T > STANDBY MDS > pebbles-s2 > pebbles-s3 > pebbles-s1 > MDS version: ceph version 17.2.7 > (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) > root@pebbles-s2 11:55 [~]: ceph fs dump > e2643889 > enable_multiple, ever_enabled_multiple: 1,1 > default compat: compat={},rocompat={},incompat={1=base v0.20,2=client > writeable ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no > anchor table,9=file layout v2,10=snaprealm v2} > legacy client fscid: 1 > > Filesystem 'ceph_backup' (1) > fs_name ceph_backup > epoch 2643888 > flags 12 joinable allow_snaps allow_multimds_snaps > created 2023-05-19T12:52:36.302135+0100 > modified 2024-07-08T11:17:55.437861+0100 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 109934182400000 > required_client_features {} > last_failure 0 > last_failure_osd_epoch 494515 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline > data,8=no anchor table,9=file layout v2,10=snaprealm v2} > max_mds 1 > in 0 > up {} > failed > damaged 0 > stopped > data_pools [6,3] > metadata_pool 2 > inline_data disabled > balancer > standby_count_wanted 1 > > > Kindest regards, > > Ivan > On 28/06/2024 15:17, Dhairya Parmar wrote: > > CAUTION: This email originated from outside of the LMB: > *.-dparmar@xxxxxxxxxx-.* > Do not click links or open attachments unless you recognize the sender and > know the content is safe. > If you think this is a phishing email, please forward it to > phishing@xxxxxxxxxxxxxxxxx > > > -- > > > On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> > wrote: > >> Hi Dhairya, >> >> I would be more than happy to share our corrupted journal. Has the host >> key changed for drop.ceph.com? The fingerprint I'm being sent is >> 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to the one >> in our /usr/share/ceph/known_hosts_drop.ceph.com. >> > Ah, strange. Let me get in touch with folks who might know about this, > will revert back to you ASAP > >> Thank you for your advice as well. We've reset our MDS' journal and are >> currently in the process of a full filesystem scrub which understandably is >> taking quite a bit of time but seems to be progressing through the objects >> fine. >> > YAY! > >> Thank you ever so much for all your help and please do feel free to >> follow up with us if you would like any further details about our crash! >> > Glad to hear it went well, this bug is being worked on with high priority > and once the patch is ready, it will be backported. > > The root cause of this issue is the `nowsync` (async dirops) being enabled > by default with kclient [0]. This feature allows asynchronous creation and > deletion of files, optimizing performance by avoiding round-trip latency > for these system calls. However, in very rare cases (like yours :D), it can > affect the system's consistency and stability hence if this kind of > optimization is not a priority for your workload, I recommend turning it > off by switching the mount points to `wsync` and also set the MDS config > `mds_client_delegate_inos_pct` to `0` so that you don't end up in this > situation again (until the bug fix arrives :)). > > [0] > https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902 > >> Kindest regards, >> >> Ivan >> On 27/06/2024 12:39, Dhairya Parmar wrote: >> >> CAUTION: This email originated from outside of the LMB: >> *.-dparmar@xxxxxxxxxx-.* >> Do not click links or open attachments unless you recognize the sender >> and know the content is safe. >> If you think this is a phishing email, please forward it to >> phishing@xxxxxxxxxxxxxxxxx >> >> >> -- >> Hi Ivan, >> >> The solution (which has been successful for us in the past) is to reset >> the journal. This would bring the fs back online and return the MDSes to a >> stable state, but some data would be lost—the data in the journal that >> hasn't been flushed to the backing store would be gone. Therefore, you >> should try to flush out as much journal data as possible before resetting >> the journal. >> >> Here are the steps for this entire process: >> >> 1) Bring the FS offline >> $ ceph fs fail <fs_name> >> >> 2) Recover dentries from journal (run it with every MDS Rank) >> $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event recover_dentries >> summary >> >> 3) Reset the journal (again with every MDS Rank) >> $ cephfs-journal-tool --rank=<fs_name>:<mds-rank> journal reset >> >> 4) Bring the FS online >> $ cephfs fs set <fs_name> joinable true >> >> 5) Restart the MDSes >> >> 6) Perform scrub to ensure consistency of fs >> $ ceph tell mds.<fs_name>:0 scrub start <path> [scrubopts] [tag] >> # you could try a recursive scrub maybe `ceph tell mds.<fs_name>:0 scrub >> start / recursive` >> >> Some important notes to keep in mind: >> * Recovering dentries will take time (generally, rank 0 is the most >> time-consuming, but the rest should be quick). >> * cephfs-journal-tool and metadata OSDs are bound to use a significant >> CPU percentage. This is because cephfs-journal-tool has to swig the journal >> data and flush it out to the backing store, which also makes the metadata >> operations go rampant, resulting in OSDs taking a significant percentage of >> CPU. >> >> Do let me know how this goes. >> >> On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> >> wrote: >> >>> Hi Dhairya, >>> >>> We can induce the crash by simply restarting the MDS and the crash seems >>> to happen when an MDS goes from up:standby to up:replay. The MDS works >>> through a few files in the log before eventually crashing where I've >>> included the logs for this here (this is after I imported the backed up >>> journal which I hope was successful but please let me know if you suspect >>> it wasn't!): >>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log >>> >>> With respect to the client logs, are you referring to the clients who >>> are writing to the filesystem? We don't typically run them in any sort of >>> debug mode and we have quite a few machines running our backup system but >>> we can look an hour or so before the first MDS crash (though I don't know >>> if this is when the de-sync occurred). Here are some MDS logs with regards >>> to the initial crash on Saturday morning though which may be helpful: >>> >>> -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick >>> -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: >>> _check_auth_rotating have uptodate secrets (they expire after >>> 2024-06-22T05:41:13.091556+0100) >>> -57> 2024-06-22T05:41:43.208+0100 7f184de84700 1 mds.pebbles-s2 >>> Updating MDS map to version 2529650 from mon.3 >>> -56> 2024-06-22T05:41:43.208+0100 7f184de84700 4 mds.0.purge_queue >>> operator(): data pool 6 not found in OSDMap >>> -55> 2024-06-22T05:41:43.208+0100 7f184de84700 4 mds.0.purge_queue >>> operator(): data pool 3 not found in OSDMap >>> -54> 2024-06-22T05:41:43.209+0100 7f184de84700 5 >>> asok(0x5592e7968000) register_command objecter_requests hook 0x5592e78f8800 >>> -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: >>> _renew_subs >>> -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: >>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 >>> -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10 >>> log_channel(cluster) update_config to_monitors: true to_syslog: false >>> syslog_facility: prio: info to_graylog: false graylog_host: 127.0.0.1 >>> graylog_port: 12201) >>> -50> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.purge_queue >>> operator(): data pool 6 not found in OSDMap >>> -49> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.purge_queue >>> operator(): data pool 3 not found in OSDMap >>> -48> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.0 >>> apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0) >>> -47> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >>> handle_mds_map i am now mds.0.2529650 >>> -46> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >>> handle_mds_map state change up:standby --> up:replay >>> -45> 2024-06-22T05:41:43.209+0100 7f184de84700 5 >>> mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay >>> -44> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >>> replay_start >>> -43> 2024-06-22T05:41:43.209+0100 7f184de84700 1 mds.0.2529650 >>> waiting for osdmap 473739 (which blocklists prior instance) >>> -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: >>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 >>> -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700 2 mds.0.cache Memory >>> usage: total 299012, rss 37624, heap 182556, baseline 182556, 0 / 0 inodes >>> have caps, 0 caps, 0 caps per inode >>> -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: >>> _renew_subs >>> -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: >>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 >>> -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: >>> handle_get_version_reply finishing 1 version 473739 >>> -37> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>> Booting: 0: opening inotable >>> -36> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>> Booting: 0: opening sessionmap >>> -35> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>> Booting: 0: opening mds log >>> -34> 2024-06-22T05:41:43.224+0100 7f1847e78700 5 mds.0.log open >>> discovering log bounds >>> -33> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>> Booting: 0: opening purge queue (async) >>> -32> 2024-06-22T05:41:43.224+0100 7f1847e78700 4 mds.0.purge_queue >>> open: opening >>> -31> 2024-06-22T05:41:43.224+0100 7f1847e78700 1 >>> mds.0.journaler.pq(ro) recover start >>> -30> 2024-06-22T05:41:43.224+0100 7f1847e78700 1 >>> mds.0.journaler.pq(ro) read_head >>> -29> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>> Booting: 0: loading open file table (async) >>> -28> 2024-06-22T05:41:43.224+0100 7f1847e78700 2 mds.0.2529650 >>> Booting: 0: opening snap table >>> -27> 2024-06-22T05:41:43.224+0100 7f1847677700 4 >>> mds.0.journalpointer Reading journal pointer '400.00000000' >>> -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10 monclient: >>> get_auth_request con 0x5592e8987000 auth_method 0 >>> -25> 2024-06-22T05:41:43.225+0100 7f1850e8a700 10 monclient: >>> get_auth_request con 0x5592e8987c00 auth_method 0 >>> -24> 2024-06-22T05:41:43.252+0100 7f1848e7a700 1 >>> mds.0.journaler.pq(ro) _finish_read_head loghead(trim 231160676352, expire >>> 231163662875, write 231163662875, stream_format 1). probing for end of log >>> (from 231163662875)... >>> -23> 2024-06-22T05:41:43.252+0100 7f1848e7a700 1 >>> mds.0.journaler.pq(ro) probing for end of the log >>> -22> 2024-06-22T05:41:43.252+0100 7f1847677700 1 >>> mds.0.journaler.mdlog(ro) recover start >>> -21> 2024-06-22T05:41:43.252+0100 7f1847677700 1 >>> mds.0.journaler.mdlog(ro) read_head >>> -20> 2024-06-22T05:41:43.252+0100 7f1847677700 4 mds.0.log Waiting >>> for journal 0x200 to recover... >>> -19> 2024-06-22T05:41:43.252+0100 7f1850689700 10 monclient: >>> get_auth_request con 0x5592e8bc6000 auth_method 0 >>> -18> 2024-06-22T05:41:43.253+0100 7f185168b700 10 monclient: >>> get_auth_request con 0x5592e8bc6800 auth_method 0 >>> -17> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 >>> mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim 90131453181952, >>> expire 90131465778558, write 90132009715463, stream_format 1). probing for >>> end of log (from 90132009715463)... >>> -16> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 >>> mds.0.journaler.mdlog(ro) probing for end of the log >>> -15> 2024-06-22T05:41:43.257+0100 7f1847e78700 1 >>> mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 90132019384791 >>> (header had 90132009715463). recovered. >>> -14> 2024-06-22T05:41:43.257+0100 7f1847677700 4 mds.0.log Journal >>> 0x200 recovered. >>> -13> 2024-06-22T05:41:43.257+0100 7f1847677700 4 mds.0.log Recovered >>> journal 0x200 in format 1 >>> -12> 2024-06-22T05:41:43.273+0100 7f1848e7a700 1 >>> mds.0.journaler.pq(ro) _finish_probe_end write_pos = 231163662875 (header >>> had 231163662875). recovered. >>> -11> 2024-06-22T05:41:43.273+0100 7f1848e7a700 4 mds.0.purge_queue >>> operator(): open complete >>> -10> 2024-06-22T05:41:43.273+0100 7f1848e7a700 1 >>> mds.0.journaler.pq(ro) set_writeable >>> -9> 2024-06-22T05:41:43.441+0100 7f1847e78700 2 mds.0.2529650 >>> Booting: 1: loading/discovering base inodes >>> -8> 2024-06-22T05:41:43.441+0100 7f1847e78700 0 mds.0.cache >>> creating system inode with ino:0x100 >>> -7> 2024-06-22T05:41:43.442+0100 7f1847e78700 0 mds.0.cache >>> creating system inode with ino:0x1 >>> -6> 2024-06-22T05:41:43.442+0100 7f1847e78700 2 mds.0.2529650 >>> Booting: 2: replaying mds log >>> -5> 2024-06-22T05:41:43.442+0100 7f1847e78700 2 mds.0.2529650 >>> Booting: 2: waiting for purge queue recovered >>> -4> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: tick >>> -3> 2024-06-22T05:41:44.090+0100 7f184ce82700 10 monclient: >>> _check_auth_rotating have uptodate secrets (they expire after >>> 2024-06-22T05:41:14.091638+0100) >>> -2> 2024-06-22T05:41:44.210+0100 7f1849e7c700 2 mds.0.cache Memory >>> usage: total 588368, rss 308304, heap 207132, baseline 182556, 0 / 15149 >>> inodes have caps, 0 caps, 0 caps per inode >>> -1> 2024-06-22T05:41:44.642+0100 7f1846675700 -1 >>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: >>> In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, >>> T)>) [with T = inodeno_t; C = std::map]' thread 7f1846675700 time >>> 2024-06-22T05:41:44.643146+0100 >>> >>> ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy >>> (stable) >>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> const*)+0x135) [0x7f18568b64a3] >>> 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669] >>> 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, >>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885] >>> 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, >>> MDPeerUpdate*)+0x4377) [0x5592e532c7b7] >>> 5: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1] >>> 6: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b] >>> 7: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041] >>> 8: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca] >>> 9: clone() >>> >>> 0> 2024-06-22T05:41:44.643+0100 7f1846675700 -1 *** Caught signal >>> (Aborted) ** >>> in thread 7f1846675700 thread_name:md_log_replay >>> >>> ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy >>> (stable) >>> 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f18558aecf0] >>> 2: gsignal() >>> 3: abort() >>> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> const*)+0x18f) [0x7f18568b64fd] >>> 5: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f18568b6669] >>> 6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, >>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5592e5027885] >>> 7: (EMetaBlob::replay(MDSRank*, LogSegment*, int, >>> MDPeerUpdate*)+0x4377) [0x5592e532c7b7] >>> 8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1] >>> 9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b] >>> 10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041] >>> 11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca] >>> 12: clone() >>> >>> We have a relatively low debug setting normally so I don't think many >>> details of the initial crash were captured unfortunately and the MDS logs >>> before the above (i.e. "-60" and older) are just beacon messages and >>> _check_auth_rotating checks. >>> >>> I was wondering whether you have any recommendations in terms of what >>> actions we could take to bring our filesystem back into a working state >>> short of rebuilding the entire metadata pool? We are quite keen to bring >>> our backup back into service urgently as we currently do not have any >>> accessible backups for our Ceph clusters. >>> >>> Kindest regards, >>> >>> Ivan >>> On 25/06/2024 19:18, Dhairya Parmar wrote: >>> >>> CAUTION: This email originated from outside of the LMB: >>> *.-dparmar@xxxxxxxxxx-.* >>> Do not click links or open attachments unless you recognize the sender >>> and know the content is safe. >>> If you think this is a phishing email, please forward it to >>> phishing@xxxxxxxxxxxxxxxxx >>> >>> >>> -- >>> >>> >>> On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> >>> wrote: >>> >>>> Hi Dhairya, >>>> >>>> Thank you for your rapid reply. I tried recovering the dentries for the >>>> file just before the crash I mentioned before and then splicing the >>>> transactions from the journal which seemed to remove that issue for that >>>> inode but resulted in the MDS crashing on the next inode in the >>>> journal when performing replay. >>>> >>> The MDS delegates a range of preallocated inodes (in form of a set - >>> interval_set<inodeno_t> preallocated_inos) to the clients, so it can be one >>> inode that is untracked or some inodes from the range or in worst case >>> scenario - ALL, and this is something that even the `cephfs-journal-tool` >>> would not be able to tell (since we're talking about MDS internals which >>> aren't exposed to such tools). That is the reason why you see "MDS crashing >>> on the next inode in the journal when performing replay". >>> >>> An option could be to expose the inode set to some tool or asok cmd to >>> identify such inodes ranges, which needs to be discussed. For now, we're >>> trying to address this in [0], you can follow the discussion there. >>> >>> [0] https://tracker.ceph.com/issues/66251 >>> >>>> Removing all the transactions involving the directory housing the files >>>> that seemed to cause these crashes from the journal only caused the MDS to >>>> fail to even start replay. >>>> >>> I've rolled back our journal to our original version when the crash >>>> first happened and the entire MDS log for the crash can be found here: >>>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24 >>>> >>> Awesome, this would help us a ton. Apart from this, would it be possible >>> to send us client logs? >>> >>>> Please let us know if you would like any other logs file as we can >>>> easily induce this crash. >>>> >>> Since you can easily induce the crash, can you share the reproducer >>> please i.e. what all action you take in order to hit this? >>> >>>> Kindest regards, >>>> >>>> Ivan >>>> On 25/06/2024 09:58, Dhairya Parmar wrote: >>>> >>>> CAUTION: This email originated from outside of the LMB: >>>> *.-dparmar@xxxxxxxxxx-.* >>>> Do not click links or open attachments unless you recognize the sender >>>> and know the content is safe. >>>> If you think this is a phishing email, please forward it to >>>> phishing@xxxxxxxxxxxxxxxxx >>>> >>>> >>>> -- >>>> Hi Ivan, >>>> >>>> This looks to be similar to the issue [0] that we're already addressing >>>> at [1]. So basically there is some out-of-sync event that led the client to >>>> make use of the inodes that MDS wasn't aware of/isn't tracking and hence >>>> the crash. It'd be really helpful if you can provide us more logs. >>>> >>>> CC @Rishabh Dave <ridave@xxxxxxxxxx> @Venky Shankar >>>> <vshankar@xxxxxxxxxx> @Patrick Donnelly <pdonnell@xxxxxxxxxx> @Xiubo Li >>>> <xiubli@xxxxxxxxxx> >>>> >>>> [0] https://tracker.ceph.com/issues/61009 >>>> [1] https://tracker.ceph.com/issues/66251 >>>> -- >>>> *Dhairya Parmar* >>>> >>>> Associate Software Engineer, CephFS >>>> >>>> <https://www.redhat.com/>IBM, Inc. >>>> >>>> On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> We have been experiencing a serious issue with our CephFS backup >>>>> cluster >>>>> running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel >>>>> (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem >>>>> are >>>>> constantly in a "replay" or "replay(laggy)" state and keep crashing. >>>>> >>>>> We have a single MDS filesystem called "ceph_backup" with 2 standby >>>>> MDSes along with a 2nd unused filesystem "ceph_archive" (this holds >>>>> little to no data) where we are using our "ceph_backup" filesystem to >>>>> backup our data and this is the one which is currently broken. The >>>>> Ceph >>>>> health outputs currently are: >>>>> >>>>> root@pebbles-s1 14:05 [~]: ceph -s >>>>> cluster: >>>>> id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 >>>>> health: HEALTH_WARN >>>>> 1 filesystem is degraded >>>>> insufficient standby MDS daemons available >>>>> 1319 pgs not deep-scrubbed in time >>>>> 1054 pgs not scrubbed in time >>>>> >>>>> services: >>>>> mon: 4 daemons, quorum >>>>> pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m) >>>>> mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4, >>>>> pebbles-s3, pebbles-s1 >>>>> mds: 2/2 daemons up >>>>> osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37 >>>>> remapped pgs >>>>> >>>>> data: >>>>> volumes: 1/2 healthy, 1 recovering >>>>> pools: 7 pools, 2177 pgs >>>>> objects: 3.55G objects, 7.0 PiB >>>>> usage: 8.9 PiB used, 14 PiB / 23 PiB avail >>>>> pgs: 83133528/30006841533 objects misplaced (0.277%) >>>>> 2090 active+clean >>>>> 47 active+clean+scrubbing+deep >>>>> 29 active+remapped+backfilling >>>>> 8 active+remapped+backfill_wait >>>>> 2 active+clean+scrubbing >>>>> 1 active+clean+snaptrim >>>>> >>>>> io: >>>>> recovery: 1.9 GiB/s, 719 objects/s >>>>> >>>>> root@pebbles-s1 14:09 [~]: ceph fs status >>>>> ceph_backup - 0 clients >>>>> =========== >>>>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS >>>>> 0 replay(laggy) pebbles-s3 0 0 0 0 >>>>> POOL TYPE USED AVAIL >>>>> mds_backup_fs metadata 1255G 2780G >>>>> ec82_primary_fs_data data 0 2780G >>>>> ec82pool data 8442T 3044T >>>>> ceph_archive - 2 clients >>>>> ============ >>>>> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS >>>>> 0 active pebbles-s2 Reqs: 0 /s 13.4k 7105 118 2 >>>>> POOL TYPE USED AVAIL >>>>> mds_archive_fs metadata 5184M 2780G >>>>> ec83_primary_fs_data data 0 2780G >>>>> ec83pool data 138T 2767T >>>>> MDS version: ceph version 17.2.7 >>>>> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) >>>>> root@pebbles-s1 14:09 [~]: ceph health detail | head >>>>> HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS >>>>> daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs not >>>>> scrubbed in time >>>>> [WRN] FS_DEGRADED: 1 filesystem is degraded >>>>> fs ceph_backup is degraded >>>>> [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons >>>>> available >>>>> have 0; want 1 more >>>>> >>>>> When our cluster first ran after a reboot, Ceph ran through the 2 >>>>> standby MDSes, crashing them all, until it reached the final MDS and >>>>> is >>>>> now stuck in this "replay(laggy)" state. Putting our MDSes into >>>>> debugging mode, we can see that this MDS crashed when replaying the >>>>> journal for a particular inode (this is the same for all the MDSes and >>>>> they all crash on the same object): >>>>> >>>>> ... >>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>> EMetaBlob.replay for [521,head] had [inode 0x1005ba89481 >>>>> [...539,head] >>>>> >>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/ >>>>> auth fragtree_t(*^2 00*^3 00000*^ >>>>> 4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4 >>>>> 01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4 >>>>> 01110*^4 >>>>> 01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4 >>>>> 10101*^3 >>>>> 10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22 >>>>> T05:41:10.213700+0100 1281276=1281276+0) n(v12 >>>>> rc2024-06-22T05:41:10.213700+0100 b1348251683896 1281277=1281276+1) >>>>> old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80] >>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>> EMetaBlob.replay dir 0x1005ba89481.011011000* >>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>> EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000* >>>>> >>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/ >>>>> [2,head] auth v=436385 cv=0/0 state=107374182 >>>>> 4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12 >>>>> rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0) >>>>> hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80] >>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>> EMetaBlob.replay added (full) [dentry >>>>> >>>>> #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti >>>>> ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion >>>>> lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1 >>>>> 0x557710444500] >>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>> EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head] >>>>> >>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_ >>>>> 005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384 s=0 >>>>> n(v0 >>>>> 1=1+0) (iversion lock) cr={99995144=0-4194304@538} 0x557710438680] >>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 >>>>> mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent >>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>> EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35 >>>>> [539,head] >>>>> >>>>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762 >>>>> 6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth >>>>> v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock) >>>>> cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1 >>>>> 0x557710438680] >>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>> EMetaBlob.replay inotable tablev 3112837 <= table 3112837 >>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal >>>>> EMetaBlob.replay sessionmap v 1560540883, table 1560540882 prealloc >>>>> [] used 0x1005cd4fe35 >>>>> 2024-06-24T13:44:55.563+0100 7f8811c40700 -1 >>>>> >>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: >>>>> I >>>>> n function 'void interval_set<T, C>::erase(T, T, >>>>> std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' >>>>> thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100 >>>>> >>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: >>>>> 568: FAILED ceph_assert(p->first <= start) >>>>> >>>>> ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) >>>>> quincy (stable) >>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>>> const*)+0x135) [0x7f8821e814a3] >>>>> 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) >>>>> [0x7f8821e81669] >>>>> 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, >>>>> inodeno_t, >>>>> std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5576f9bb2885] >>>>> 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, >>>>> MDPeerUpdate*)+0x4377) [0x5576f9eb77b7] >>>>> 5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1] >>>>> 6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b] >>>>> 7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041] >>>>> 8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca] >>>>> 9: clone() >>>>> >>>>> I've only included a short section of the crash (this is the first >>>>> trace >>>>> in the log with regards to the crash with a 10/20 debug_mds option). >>>>> We >>>>> tried deleting the 0x1005cd4fe35 object from the object store using >>>>> the >>>>> "rados" command but this did not allow our MDS to successfully replay. >>>>> >>>>> From my understanding the journal seems okay as we didn't run out of >>>>> space for example on our metadata pool and "cephfs-journal-tool >>>>> journal >>>>> inspect" doesn't seem to think there is any damage: >>>>> >>>>> root@pebbles-s1 13:58 [~]: cephfs-journal-tool >>>>> --rank=ceph_backup:0 >>>>> journal inspect >>>>> Overall journal integrity: OK >>>>> root@pebbles-s1 14:04 [~]: cephfs-journal-tool >>>>> --rank=ceph_backup:0 >>>>> event get --inode 1101069090357 summary >>>>> Events by type: >>>>> OPEN: 1 >>>>> UPDATE: 3 >>>>> Errors: 0 >>>>> root@pebbles-s1 14:05 [~]: cephfs-journal-tool >>>>> --rank=ceph_backup:0 >>>>> event get --inode 1101069090357 list >>>>> 2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE: (openc) >>>>> >>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>>> 2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE: >>>>> (check_inode_max_size) >>>>> >>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>>> >>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>>> 2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN: () >>>>> >>>>> FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>>> 2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE: (rename) >>>>> >>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial >>>>> >>>>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc >>>>> >>>>> I was wondering whether anyone had any advice for us on how we should >>>>> proceed forward? We were thinking about manually applying these events >>>>> (via "event apply") where failing that we could erase this problematic >>>>> event with "cephfs-journal-tool --rank=ceph_backup:0 event splice >>>>> --inode 1101069090357". Is this a good idea? We would rather not >>>>> rebuild >>>>> the entire metadata pool if we could avoid it (once was enough for us) >>>>> as this cluster has ~9 PB of data on it. >>>>> >>>>> Kindest regards, >>>>> >>>>> Ivan Clayson >>>>> >>>>> -- >>>>> Ivan Clayson >>>>> ----------------- >>>>> Scientific Computing Officer >>>>> Room 2N249 >>>>> Structural Studies >>>>> MRC Laboratory of Molecular Biology >>>>> Francis Crick Ave, Cambridge >>>>> CB2 0QH >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >>>> -- >>>> Ivan Clayson >>>> ----------------- >>>> Scientific Computing Officer >>>> Room 2N249 >>>> Structural Studies >>>> MRC Laboratory of Molecular Biology >>>> Francis Crick Ave, Cambridge >>>> CB2 0QH >>>> >>>> -- >>> Ivan Clayson >>> ----------------- >>> Scientific Computing Officer >>> Room 2N249 >>> Structural Studies >>> MRC Laboratory of Molecular Biology >>> Francis Crick Ave, Cambridge >>> CB2 0QH >>> >>> -- >> Ivan Clayson >> ----------------- >> Scientific Computing Officer >> Room 2N249 >> Structural Studies >> MRC Laboratory of Molecular Biology >> Francis Crick Ave, Cambridge >> CB2 0QH >> >> -- > Ivan Clayson > ----------------- > Scientific Computing Officer > Room 2N249 > Structural Studies > MRC Laboratory of Molecular Biology > Francis Crick Ave, Cambridge > CB2 0QH > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx