My ceph cluster's MDS aren't working. They start up, go into a reconnect state followed by rejoin then they crash and the cycle repeats. I am running containerized octopus and currently have two file systems. The one that is having problems had to be rebuilt and was working well enough to copy about 3/4 of the data into a new file system until this issue started. I am happy to post more information from the logs if anyone is interested, but in the interest of making this readable I will try to show just the relevant parts. It looks to me like tthe replay phase is successful and it crashes when trying to reclaim the sessions. Can this be skipped somehow? What would happen if sessions weren't reclaimed? What are my next steps to debug this Thanks in advance! 2022-01-04T14:13:22.212+0000 7f36c175c700 1 mds.0.62859 handle_mds_map i am now mds.0.62859 2022-01-04T14:13:22.212+0000 7f36c175c700 1 mds.0.62859 handle_mds_map state change up:rejoin --> up:clientreplay 2022-01-04T14:13:22.212+0000 7f36c175c700 1 mds.0.62859 recovery_done -- successful recovery! 2022-01-04T14:13:22.222+0000 7f36c175c700 1 mds.0.62859 clientreplay_start 2022-01-04T14:13:22.222+0000 7f36c175c700 1 mds.0.62859 still have 0 requests need to be replayed, 18 sessions need to be reclaimed 2022-01-04T14:13:22.427+0000 7f36bc752700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/mds/CDir.cc: In function 'void CDir::try_remove_dentries_for_stray()' thread 7f36bc752700 time 2022-01-04T14:13:22.427081+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/mds/CDir.cc: 762: FAILED ceph_assert(dn->get_linkage()->is_null()) ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f36c9580d10] 2: (()+0x27af2a) [0x7f36c9580f2a] 3: (CDir::try_remove_dentries_for_stray()+0x269) [0x56437c970469] 4: (MDCache::clear_dirty_bits_for_stray(CInode*)+0x1c7) [0x56437c8625e7] 5: (StrayManager::_eval_stray(CDentry*)+0x3ba) [0x56437c8d123a] 6: (StrayManager::eval_stray(CDentry*)+0x1f) [0x56437c8d1baf] 7: (MDCache::scan_stray_dir(dirfrag_t)+0x1dc) [0x56437c83f64c] 8: (MDCache::populate_mydir()+0x510) [0x56437c83fd30] 9: (MDCache::open_root()+0xd8) [0x56437c8523f8] 10: (C_MDS_RetryOpenRoot::finish(int)+0x27) [0x56437c8c9347] 11: (MDSContext::complete(int)+0x56) [0x56437ca182a6] 12: (MDSRank::_advance_queues()+0x8c) [0x56437c74f6dc] 13: (MDSRank::ProgressThread::entry()+0xc5) [0x56437c74fde5] 14: (()+0x814a) [0x7f36c815d14a] 15: (clone()+0x43) [0x7f36c6c7df23] 2022-01-04T14:13:22.429+0000 7f36bc752700 -1 *** Caught signal (Aborted) ** in thread 7f36bc752700 thread_name:mds_rank_progr ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable) 1: (()+0x12b20) [0x7f36c8167b20] 2: (gsignal()+0x10f) [0x7f36c6bb87ff] 3: (abort()+0x127) [0x7f36c6ba2c35] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f36c9580d61] 5: (()+0x27af2a) [0x7f36c9580f2a] 6: (CDir::try_remove_dentries_for_stray()+0x269) [0x56437c970469] 7: (MDCache::clear_dirty_bits_for_stray(CInode*)+0x1c7) [0x56437c8625e7] 8: (StrayManager::_eval_stray(CDentry*)+0x3ba) [0x56437c8d123a] 9: (StrayManager::eval_stray(CDentry*)+0x1f) [0x56437c8d1baf] 10: (MDCache::scan_stray_dir(dirfrag_t)+0x1dc) [0x56437c83f64c] 11: (MDCache::populate_mydir()+0x510) [0x56437c83fd30] 12: (MDCache::open_root()+0xd8) [0x56437c8523f8] 13: (C_MDS_RetryOpenRoot::finish(int)+0x27) [0x56437c8c9347] 14: (MDSContext::complete(int)+0x56) [0x56437ca182a6] 15: (MDSRank::_advance_queues()+0x8c) [0x56437c74f6dc] 16: (MDSRank::ProgressThread::entry()+0xc5) [0x56437c74fde5] 17: (()+0x814a) [0x7f36c815d14a] 18: (clone()+0x43) [0x7f36c6c7df23] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -9999> 2022-01-04T14:13:22.327+0000 7f36bc752700 4 mds.0.purge_queue push: pushing inode 0x1000038fd64 -9998> 2022-01-04T14:13:22.327+0000 7f36bc752700 4 mds.0.purge_queue push: pushing inode 0x10000389f16 -9997> 2022-01-04T14:13:22.327+0000 7f36bc752700 4 mds.0.purge_queue push: pushing inode 0x200000187f5 -9996> 2022-01-04T14:13:22.327+0000 7f36bc752700 4 mds.0.purge_queue push: pushing inode 0x200000300e0 -9995> 2022-01-04T14:13:22.327+0000 7f36bc752700 4 mds.0.purge_queue push: pushing inode 0x20000018f69 -9994> 2022-01-04T14:13:22.327+0000 7f36bc752700 4 mds.0.purge_queue push: pushing inode 0x10000388b01 # ceph fs status recovery-fs - 21 clients =========== RANK STATE MDS ACTIVITY DNS INOS 0 reconnect burnsmds.burns-6.mmxnkw 99.0k 9235 POOL TYPE USED AVAIL recovery metadata 6021M 5303G burns.data data 644T 787T burnsfs - 1 clients ======= RANK STATE MDS ACTIVITY DNS INOS 0 active burnsmds.burns-2.hzqvvv Reqs: 0 /s 209k 197k 1 active burnsmds.burns-1.btdfhp Reqs: 0 /s 4420 1838 POOL TYPE USED AVAIL burnsfs.meta metadata 8033M 5303G burnsfs.data data 286T 787T STANDBY MDS burnsmds.burns-3.zliilk MDS version: ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable) _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx