I managed to recover my filesystem. cephfs-journal-tool journal export cephfs-journal-tool event recover_dentries summary Both failed But truncating the journal and following some of the instructions in https://people.redhat.com/bhubbard/nature/default/cephfs/disaster-recovery-experts/ helped me to get the mds up. Then I scrubbed and repaired the filesystem, and I “believe” I’m back in business. What is weird though is that an assert failed as shown in the stack dump below. Was that a legitimate assertion that indicates a bigger issue, or was it a false assertion? Also, I understand that the metadata itself is sitting on the disk, but it looks like a single point of failure. What’s the logic behind having a simple metadata location, but multiple mds servers? Thanks! George On Sep 24, 2024, at 5:55 AM, Eugen Block <eblock@xxxxxx> wrote: Hi, I would probably start by inspecting the journal with the cephfs-journal-tool [0]: cephfs-journal-tool [--rank=<fs_name>:{mds-rank|all}] journal inspect And it could be helful to have the logs prior to the assert. [0] https://docs.ceph.com/en/latest/cephfs/cephfs-journal-tool/#example-journal-inspect Zitat von "Kyriazis, George" <george.kyriazis@xxxxxxxxx>: Hello ceph users, I am in the unfortunate situation of having a status of “1 mds daemon damaged”. Looking at the logs, I see that the daemon died with an assert as follows: ./src/osdc/Journaler.cc: 1368: FAILED ceph_assert(trim_to > trimming_pos) ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x73a83189d7d9] 2: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974] 3: (Journaler::_trim()+0x671) [0x57235caa70b1] 4: (Journaler::_finish_write_head(int, Journaler::Header&, C_OnFinisher*)+0x171) [0x57235caaa8f1] 5: (Context::complete(int)+0x9) [0x57235c716849] 6: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d] 7: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134] 8: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc] 0> 2024-09-23T14:10:26.490-0500 73a822c006c0 -1 *** Caught signal (Aborted) ** in thread 73a822c006c0 thread_name:MR_Finisher ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable) 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x73a83105b050] 2: /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x73a8310a9e2c] 3: gsignal() 4: abort() 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185) [0x73a83189d834] 6: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974] 7: (Journaler::_trim()+0x671) [0x57235caa70b1] 8: (Journaler::_finish_write_head(int, Journaler::Header&, C_OnFinisher*)+0x171) [0x57235caaa8f1] 9: (Context::complete(int)+0x9) [0x57235c716849] 10: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d] 11: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134] 12: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. As listed above, I am running 18.2.2 on a proxmox cluster with a hybrid hdd/sdd setup. 2 cephfs filesystems. The mds responsible for the hdd filesystem is the one that died. Output of ceph -s follows: root@vis-mgmt:~/bin# ceph -s cluster: id: ec2c9542-dc1b-4af6-9f21-0adbcabb9452 health: HEALTH_ERR 1 filesystem is degraded 1 filesystem is offline 1 mds daemon damaged 5 pgs not scrubbed in time 1 daemons have recently crashed services: mon: 5 daemons, quorum vis-hsw-01,vis-skx-01,vis-clx-15,vis-clx-04,vis-icx-00 (age 6m) mgr: vis-hsw-02(active, since 13d), standbys: vis-skx-02, vis-hsw-04, vis-clx-08, vis-clx-02 mds: 1/2 daemons up, 5 standby osd: 97 osds: 97 up (since 3h), 97 in (since 4d) data: volumes: 1/2 healthy, 1 recovering; 1 damaged pools: 14 pools, 1961 pgs objects: 223.70M objects, 304 TiB usage: 805 TiB used, 383 TiB / 1.2 PiB avail pgs: 1948 active+clean 9 active+clean+scrubbing+deep 4 active+clean+scrubbing io: client: 86 KiB/s rd, 5.5 MiB/s wr, 64 op/s rd, 26 op/s wr I tried restarting all the mds deamons but they are all marked as “standby”. I also tried restarting all the mons and then the mds daemons again, but that didn’t help. Much help is appreciated! Thank you! George _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx