Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

Dhairya Parmar <dparmar@xxxxxxxxxx> · Tue, 25 Jun 2024 23:48:20 +0530

On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx> wrote:

> Hi Dhairya,
>
> Thank you for your rapid reply. I tried recovering the dentries for the
> file just before the crash I mentioned before and then splicing the
> transactions from the journal which seemed to remove that issue for that
> inode but resulted in the MDS crashing on the next inode in the journal
> when performing replay.
>
The MDS delegates a range of preallocated inodes (in form of a set -
interval_set<inodeno_t> preallocated_inos) to the clients, so it can be one
inode that is untracked or some inodes from the range or in worst case
scenario - ALL, and this is something that even the `cephfs-journal-tool`
would not be able to tell (since we're talking about MDS internals which
aren't exposed to such tools). That is the reason why you see "MDS crashing
on the next inode in the journal when performing replay".

An option could be to expose the inode set to some tool or asok cmd to
identify such inodes ranges, which needs to be discussed. For now, we're
trying to address this in [0], you can follow the discussion there.

[0] https://tracker.ceph.com/issues/66251

> Removing all the transactions involving the directory housing the files
> that seemed to cause these crashes from the journal only caused the MDS to
> fail to even start replay.
>
I've rolled back our journal to our original version when the crash first
> happened and the entire MDS log for the crash can be found here:
> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24
>
Awesome, this would help us a ton. Apart from this, would it be possible to
send us client logs?

> Please let us know if you would like any other logs file as we can easily
> induce this crash.
>
Since you can easily induce the crash, can you share the reproducer please
i.e. what all action you take in order to hit this?

> Kindest regards,
>
> Ivan
> On 25/06/2024 09:58, Dhairya Parmar wrote:
>
> CAUTION: This email originated from outside of the LMB:
> *.-dparmar@xxxxxxxxxx-.*
> Do not click links or open attachments unless you recognize the sender and
> know the content is safe.
> If you think this is a phishing email, please forward it to
> phishing@xxxxxxxxxxxxxxxxx
>
>
> --
> Hi Ivan,
>
> This looks to be similar to the issue [0] that we're already addressing at
> [1]. So basically there is some out-of-sync event that led the client to
> make use of the inodes that MDS wasn't aware of/isn't tracking and hence
> the crash. It'd be really helpful if you can provide us more logs.
>
> CC @Rishabh Dave <ridave@xxxxxxxxxx> @Venky Shankar <vshankar@xxxxxxxxxx> @Patrick
> Donnelly <pdonnell@xxxxxxxxxx> @Xiubo Li <xiubli@xxxxxxxxxx>
>
> [0] https://tracker.ceph.com/issues/61009
> [1] https://tracker.ceph.com/issues/66251
> --
> *Dhairya Parmar*
>
> Associate Software Engineer, CephFS
>
> <https://www.redhat.com/>IBM, Inc.
>
> On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson <ivan@xxxxxxxxxxxxxxxxx>
> wrote:
>
>> Hello,
>>
>> We have been experiencing a serious issue with our CephFS backup cluster
>> running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel
>> (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are
>> constantly in a "replay" or "replay(laggy)" state and keep crashing.
>>
>> We have a single MDS filesystem called "ceph_backup" with 2 standby
>> MDSes along with a 2nd unused filesystem "ceph_archive" (this holds
>> little to no data) where we are using our "ceph_backup" filesystem to
>> backup our data and this is the one which is currently broken. The Ceph
>> health outputs currently are:
>>
>>     root@pebbles-s1 14:05 [~]: ceph -s
>>        cluster:
>>          id:     e3f7535e-d35f-4a5d-88f0-a1e97abcd631
>>          health: HEALTH_WARN
>>                  1 filesystem is degraded
>>                  insufficient standby MDS daemons available
>>                  1319 pgs not deep-scrubbed in time
>>                  1054 pgs not scrubbed in time
>>
>>        services:
>>          mon: 4 daemons, quorum
>>     pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
>>          mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4,
>>     pebbles-s3, pebbles-s1
>>          mds: 2/2 daemons up
>>          osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37
>>     remapped pgs
>>
>>        data:
>>          volumes: 1/2 healthy, 1 recovering
>>          pools:   7 pools, 2177 pgs
>>          objects: 3.55G objects, 7.0 PiB
>>          usage:   8.9 PiB used, 14 PiB / 23 PiB avail
>>          pgs:     83133528/30006841533 objects misplaced (0.277%)
>>                   2090 active+clean
>>                   47   active+clean+scrubbing+deep
>>                   29   active+remapped+backfilling
>>                   8    active+remapped+backfill_wait
>>                   2    active+clean+scrubbing
>>                   1    active+clean+snaptrim
>>
>>        io:
>>          recovery: 1.9 GiB/s, 719 objects/s
>>
>>     root@pebbles-s1 14:09 [~]: ceph fs status
>>     ceph_backup - 0 clients
>>     ===========
>>     RANK      STATE         MDS      ACTIVITY   DNS    INOS   DIRS CAPS
>>       0    replay(laggy)  pebbles-s3               0      0 0      0
>>              POOL            TYPE     USED  AVAIL
>>         mds_backup_fs      metadata  1255G  2780G
>>     ec82_primary_fs_data    data       0   2780G
>>            ec82pool          data    8442T  3044T
>>     ceph_archive - 2 clients
>>     ============
>>     RANK  STATE      MDS         ACTIVITY     DNS    INOS   DIRS CAPS
>>       0    active  pebbles-s2  Reqs:    0 /s  13.4k  7105    118 2
>>              POOL            TYPE     USED  AVAIL
>>         mds_archive_fs     metadata  5184M  2780G
>>     ec83_primary_fs_data    data       0   2780G
>>            ec83pool          data     138T  2767T
>>     MDS version: ceph version 17.2.7
>>     (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
>>     root@pebbles-s1 14:09 [~]: ceph health detail | head
>>     HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS
>>     daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs not
>>     scrubbed in time
>>     [WRN] FS_DEGRADED: 1 filesystem is degraded
>>          fs ceph_backup is degraded
>>     [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons
>>     available
>>          have 0; want 1 more
>>
>> When our cluster first ran after a reboot, Ceph ran through the 2
>> standby MDSes, crashing them all, until it reached the final MDS and is
>> now stuck in this "replay(laggy)" state. Putting our MDSes into
>> debugging mode, we can see that this MDS crashed when replaying the
>> journal for a particular inode (this is the same for all the MDSes and
>> they all crash on the same object):
>>
>>     ...
>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>     EMetaBlob.replay for [521,head] had [inode 0x1005ba89481
>>     [...539,head]
>>
>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
>>     auth fragtree_t(*^2 00*^3 00000*^
>>     4 00001*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4
>>     01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4 01110*^4
>>     01111*^4 10*^3 10000*^4 10001*^4 10010*^4 10011*^4 10100*^4 10101*^3
>>     10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22
>>     T05:41:10.213700+0100 1281276=1281276+0) n(v12
>>     rc2024-06-22T05:41:10.213700+0100 b1348251683896 1281277=1281276+1)
>>     old_inodes=8 (iversion lock) | dirfrag=416 dirty=1 0x55770a2bdb80]
>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>     EMetaBlob.replay dir 0x1005ba89481.011011000*
>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>     EMetaBlob.replay updated dir [dir 0x1005ba89481.011011000*
>>
>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
>>     [2,head] auth v=436385 cv=0/0 state=107374182
>>     4 f(v0 m2024-06-22T05:41:10.213700+0100 2502=2502+0) n(v12
>>     rc2024-06-22T05:41:10.213700+0100 b2120744220 2502=2502+0)
>>     hs=32+33,ss=0+0 dirty=65 | child=1 0x55770ebcda80]
>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>     EMetaBlob.replay added (full) [dentry
>>
>> #0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
>>     ons_ave_Z124.mrc.teberet7.partial [539,head] auth NULL (dversion
>>     lock) v=436384 ino=(nil) state=1610612800|bottomlru | dirty=1
>>     0x557710444500]
>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>     EMetaBlob.replay added [inode 0x1005cd4fe35 [539,head]
>>
>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
>>     005006_fractions_ave_Z124.mrc.teberet7.partial auth v436384 s=0 n(v0
>>     1=1+0) (iversion lock) cr={99995144=0-4194304@538} 0x557710438680]
>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10
>>     mds.0.cache.ino(0x1005cd4fe35) mark_dirty_parent
>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>     EMetaBlob.replay noting opened inode [inode 0x1005cd4fe35 [539,head]
>>
>> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
>>     6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial auth
>>     v436384 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
>>     cr={99995144=0-4194304@538} | dirtyparent=1 dirty=1 0x557710438680]
>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>     EMetaBlob.replay inotable tablev 3112837 <= table 3112837
>>     2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
>>     EMetaBlob.replay sessionmap v 1560540883, table 1560540882 prealloc
>>     [] used 0x1005cd4fe35
>>     2024-06-24T13:44:55.563+0100 7f8811c40700 -1
>>
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
>>     I
>>     n function 'void interval_set<T, C>::erase(T, T,
>>     std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]'
>>     thread 7f8811c40700 time 2024-06-24T13:44:55.564315+0100
>>
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
>>     568: FAILED ceph_assert(p->first <= start)
>>
>>       ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2)
>>     quincy (stable)
>>       1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>     const*)+0x135) [0x7f8821e814a3]
>>       2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f8821e81669]
>>       3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t,
>>     std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x5576f9bb2885]
>>       4: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
>>     MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
>>       5: (EUpdate::replay(MDSRank*)+0x61) [0x5576f9ebbbd1]
>>       6: (MDLog::_replay_thread()+0x7bb) [0x5576f9e4254b]
>>       7: (MDLog::ReplayThread::entry()+0x11) [0x5576f9af5041]
>>       8: /lib64/libpthread.so.0(+0x81ca) [0x7f8820e6f1ca]
>>       9: clone()
>>
>> I've only included a short section of the crash (this is the first trace
>> in the log with regards to the crash with a 10/20 debug_mds option). We
>> tried deleting the 0x1005cd4fe35 object from the object store using the
>> "rados" command but this did not allow our MDS to successfully replay.
>>
>>  From my understanding the journal seems okay as we didn't run out of
>> space for example on our metadata pool and "cephfs-journal-tool journal
>> inspect" doesn't seem to think there is any damage:
>>
>>     root@pebbles-s1 13:58 [~]: cephfs-journal-tool --rank=ceph_backup:0
>>     journal inspect
>>     Overall journal integrity: OK
>>     root@pebbles-s1 14:04 [~]: cephfs-journal-tool --rank=ceph_backup:0
>>     event get --inode 1101069090357 summary
>>     Events by type:
>>        OPEN: 1
>>        UPDATE: 3
>>     Errors: 0
>>     root@pebbles-s1 14:05 [~]: cephfs-journal-tool --rank=ceph_backup:0
>>     event get --inode 1101069090357 list
>>     2024-06-22T05:41:10.214635+0100 0x51f97d4cfe35 UPDATE:  (openc)
>>
>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>     2024-06-22T05:41:11.203312+0100 0x51f97d59c848 UPDATE:
>>     (check_inode_max_size)
>>
>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>
>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>     2024-06-22T05:41:15.484871+0100 0x51f97e7344cc OPEN:  ()
>>
>> FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>     2024-06-22T05:41:15.484921+0100 0x51f97e73493b UPDATE:  (rename)
>>
>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
>>
>> test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc
>>
>> I was wondering whether anyone had any advice for us on how we should
>> proceed forward? We were thinking about manually applying these events
>> (via "event apply") where failing that we could erase this problematic
>> event with "cephfs-journal-tool --rank=ceph_backup:0 event splice
>> --inode 1101069090357". Is this a good idea? We would rather not rebuild
>> the entire metadata pool if we could avoid it (once was enough for us)
>> as this cluster has ~9 PB of data on it.
>>
>> Kindest regards,
>>
>> Ivan Clayson
>>
>> --
>> Ivan Clayson
>> -----------------
>> Scientific Computing Officer
>> Room 2N249
>> Structural Studies
>> MRC Laboratory of Molecular Biology
>> Francis Crick Ave, Cambridge
>> CB2 0QH
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
> --
> Ivan Clayson
> -----------------
> Scientific Computing Officer
> Room 2N249
> Structural Studies
> MRC Laboratory of Molecular Biology
> Francis Crick Ave, Cambridge
> CB2 0QH
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx