Hi, I'm struggling with a Ceph cluster that doesn't want to end recovery. I suspect there is multiple issues at the same time. So let's start with the most obvious, the MDS daemons are crashing and blocking me from mounting my CephFS. The cluster has been deployed on Debian Buster and later upgraded to Bullseye. There is 4 nodes of 64 OSDs of 12TB. I can't understand which Ceph version it is on because multiple services have different versions (another clue about what could be wrong). There is some 16.2.4, 16.2.5, 16.2.13 and even some 17.2.6 and I think I saw some 15.x.x on the OSDs. I have this log from the service: > journalctl -xe -u ceph-938952d4-7775-11eb-9f42-bc97e19a216a@xxxxxxxxxxxxxxx-storage-02.oaelhv.service [...] /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUIL> ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f850b1ed59c] 2: /usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7f850b1ed7b6] 3: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0x1215) [0x55768a6ad6b5] 4: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0x105) [0x55768a6ad965] 5: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned long, utime_t)+0xc39) [0x55768a78fb09] 6: (RecoveryQueue::_recovered(CInode*, int, unsigned long, utime_t)+0x92d) [0x55768a76377d] 7: (MDSContext::complete(int)+0x56) [0x55768a8c0d46] 8: (MDSIOContextBase::complete(int)+0xa3) [0x55768a8c1073] 9: (Filer::C_Probe::finish(int)+0xb5) [0x55768a974af5] 10: (Context::complete(int)+0xd) [0x55768a5b4b6d] 11: (Finisher::finisher_thread_entry()+0x1a5) [0x7f850b28c9d5] 12: /lib64/libpthread.so.0(+0x814a) [0x7f8509f8f14a] 13: clone() debug 0> 2023-09-11T12:55:25.765+0000 7f84fc580700 -1 *** Caught signal (Aborted) ** in thread 7f84fc580700 thread_name:MR_Finisher ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable) 1: /lib64/libpthread.so.0(+0x12b20) [0x7f8509f99b20] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f850b1ed5ed] 5: /usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7f850b1ed7b6] 6: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0x1215) [0x55768a6ad6b5] 7: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, snapid_t)+0x105) [0x55768a6ad965] 8: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned long, utime_t)+0xc39) [0x55768a78fb09] 9: (RecoveryQueue::_recovered(CInode*, int, unsigned long, utime_t)+0x92d) [0x55768a76377d] 10: (MDSContext::complete(int)+0x56) [0x55768a8c0d46] 11: (MDSIOContextBase::complete(int)+0xa3) [0x55768a8c1073] 12: (Filer::C_Probe::finish(int)+0xb5) [0x55768a974af5] 13: (Context::complete(int)+0xd) [0x55768a5b4b6d] 14: (Finisher::finisher_thread_entry()+0x1a5) [0x7f850b28c9d5] 15: /lib64/libpthread.so.0(+0x814a) [0x7f8509f8f14a] 16: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 140209120859904 / 140209129252608 / md_submit 140209137645312 / 140209146038016 / MR_Finisher 140209162823424 / PQ_Finisher 140209213179648 / ceph-mds 140209229965056 / safe_timer 140209246750464 / ms_dispatch 140209263535872 / io_context_pool 140209280321280 / admin_socket 140209288713984 / msgr-worker-2 140209297106688 / msgr-worker-1 140209305499392 / msgr-worker-0 140209549580160 / ceph-mds max_recent 10000 max_new 10000 log_file /var/lib/ceph/crash/2023-09-11T12:55:25.769044Z_5229ea26-9bea-4c49-960a-4501b227e545/log --- end dump of recent events --- ceph-938952d4-7775-11eb-9f42-bc97e19a216a@xxxxxxxxxxxxxxx-storage-02.oaelhv.service: Main process exited, code=exited, status=134/n/a I tried the Advanced: Metadata Repair Tools https://docs.ceph.com/en/nautilus/cephfs/disaster-recovery-experts/ <https://docs.ceph.com/en/nautilus/cephfs/disaster-recovery-experts/> It goes, "recovery successfull" in the logs, then goes "has been put in standby", then crashes a few seconds or minutes later. I don't know what to do now without breaking everything Thank you for your help Let me know if you need any other informations Have a nice day, Sasha _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx