Hi Ivan, This looks to be similar to the issue [0] that we're already addressing at [1]. So basically there is some out-of-sync event that led the client to make use of the inodes that MDS wasn't aware of/isn't tracking and hence the crash. It'd be really helpful if you can provide us more logs. CC @Rishabh Dave <ridave@xxxxxxxxxx> @Venky Shankar <vshankar@xxxxxxxxxx> @Patrick Donnelly <pdonnell@xxxxxxxxxx> @Xiubo Li <xiubli@xxxxxxxxxx> [0] https://tracker.ceph.com/issues/61009 [1] https://tracker.ceph.com/issues/66251 -- *Dhairya Parmar* Associate Software Engineer, CephFS <https://www.redhat.com/>IBM, Inc. On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote: > Hello, > > We have been experiencing a serious issue with our CephFS backup cluster > running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel > (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are > constantly in a "replay" or "replay(laggy)" state and keep crashing. > > We have a single MDS filesystem called "ceph_backup" with 2 standby > MDSes along with a 2nd unused filesystem "ceph_archive" (this holds > little to no data) where we are using our "ceph_backup" filesystem to > backup our data and this is the one which is currently broken. The Ceph > health outputs currently are: > > root@pebbles-s1 14:05 [~]: ceph -s > cluster: > id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 > health: HEALTH_WARN > 1 filesystem is degraded > insufficient standby MDS daemons available > 1319 pgs not deep-scrubbed in time > 1054 pgs not scrubbed in time > > services: > mon: 4 daemons, quorum > pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m) > mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4, > pebbles-s3, pebbles-s1 > mds: 2/2 daemons up > osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37 > remapped pgs > > data: > volumes: 1/2 healthy, 1 recovering > pools: 7 pools, 2177 pgs > objects: 3.55G objects, 7.0 PiB > usage: 8.9 PiB used, 14 PiB / 23 PiB avail > pgs: 83133528/30006841533 objects misplaced (0.277%) > 2090 active+clean > 47 active+clean+scrubbing+deep > 29 active+remapped+backfilling > 8 active+remapped+backfill_wait > 2 active+clean+scrubbing > 1 active+clean+snaptrim > > io: > recovery: 1.9 GiB/s, 719 objects/s > > root@pebbles-s1 14:09 [~]: ceph fs status > ceph_backup - 0 clients > =========== > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > 0 replay(laggy) pebbles-s3 0 0 0 0 > POOL TYPE USED AVAIL > mds_backup_fs metadata 1255G 2780G > ec82_primary_fs_data data 0 2780G > ec82pool data 8442T 3044T > ceph_archive - 2 clients > ============ > RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS > 0 active pebbles-s2 Reqs: 0 /s 13.4k 7105 118 2 > POOL TYPE USED AVAIL > mds_archive_fs metadata 5184M 2780G > ec83_primary_fs_data data 0 2780G > ec83pool data 138T 2767T > MDS version: ceph version 17.2.7 > (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) > root@pebbles-s1 14:09 [~]: ceph health detail | head > HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS > daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs not > scrubbed in time > [WRN] FS_DEGRADED: 1 filesystem is degraded > fs ceph_backup is degraded > [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons > available > have 0; want 1 more > > When our cluster first ran after a reboot, Ceph ran through the 2 > standby MDSes, crashing them all, until it reached the final MDS and is > now stuck in this "replay(laggy)" state. Putting our MDSes into > debugging mode, we can see that this MDS crashed when replaying the > journal for a particular inode (this is the same for all the MDSes and > they all crash on the same object): > > ... > 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal > EMetaBlob.replay for [521,head] had [inode 0x1005ba89481 > [...539,head] > > /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/ > auth fragtree_t(*^2 00*^3 00000*^ > 4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4 > 01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4 01110*^4 > 01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4 10101*^3 > 10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22 > T05:41:10.213700+0100 1281276=1281276+0) n(v12 > rc2024-06-22T05:41:10.213700+0100 b1348251683896 1281277=1281276+1) > old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80] > 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal > EMetaBlob.replay dir 0x1005ba89481.011011000* > 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal > EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000* > > /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/ > [2,head] auth v=436385 cv=0/0 state=107374182 > 4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12 > rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0) > hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80] > 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal > EMetaBlob.replay added (full) [dentry > > #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti > ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion > lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1 > 0x557710444500] > 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal > EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head] > > /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_ > 005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384 s=0 n(v0 > 1=1+0) (iversion lock) cr={99995144=0-4194304@538} 0x557710438680] > 2024-06-24T13:44:55.563+0100 7f8811c40700 10 > mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent > 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal > EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35 [539,head] > > /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762 > 6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth > v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock) > cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1 0x557710438680] > 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal > EMetaBlob.replay inotable tablev 3112837 <= table 3112837 > 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal > EMetaBlob.replay sessionmap v 1560540883, table 1560540882 prealloc > [] used 0x1005cd4fe35 > 2024-06-24T13:44:55.563+0100 7f8811c40700 -1 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: > I > n function 'void interval_set<T, C>::erase(T, T, > std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' > thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h: > 568: FAILED ceph_assert(p->first <= start) > > ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) > quincy (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x135) [0x7f8821e814a3] > 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f8821e81669] > 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, > std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5576f9bb2885] > 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, > MDPeerUpdate*)+0x4377) [0x5576f9eb77b7] > 5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1] > 6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b] > 7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041] > 8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca] > 9: clone() > > I've only included a short section of the crash (this is the first trace > in the log with regards to the crash with a 10/20 debug_mds option). We > tried deleting the 0x1005cd4fe35 object from the object store using the > "rados" command but this did not allow our MDS to successfully replay. > > From my understanding the journal seems okay as we didn't run out of > space for example on our metadata pool and "cephfs-journal-tool journal > inspect" doesn't seem to think there is any damage: > > root@pebbles-s1 13:58 [~]: cephfs-journal-tool --rank=ceph_backup:0 > journal inspect > Overall journal integrity: OK > root@pebbles-s1 14:04 [~]: cephfs-journal-tool --rank=ceph_backup:0 > event get --inode 1101069090357 summary > Events by type: > OPEN: 1 > UPDATE: 3 > Errors: 0 > root@pebbles-s1 14:05 [~]: cephfs-journal-tool --rank=ceph_backup:0 > event get --inode 1101069090357 list > 2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE: (openc) > > test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial > 2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE: > (check_inode_max_size) > > test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial > > test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial > 2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN: () > > FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial > 2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE: (rename) > > test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial > > test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc > > I was wondering whether anyone had any advice for us on how we should > proceed forward? We were thinking about manually applying these events > (via "event apply") where failing that we could erase this problematic > event with "cephfs-journal-tool --rank=ceph_backup:0 event splice > --inode 1101069090357". Is this a good idea? We would rather not rebuild > the entire metadata pool if we could avoid it (once was enough for us) > as this cluster has ~9 PB of data on it. > > Kindest regards, > > Ivan Clayson > > -- > Ivan Clayson > ----------------- > Scientific Computing Officer > Room 2N249 > Structural Studies > MRC Laboratory of Molecular Biology > Francis Crick Ave, Cambridge > CB2 0QH > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx