MDS keeps crashing and restarting

"Anderson, Erik" <EAnderson@xxxxxxxxxxxxxxxxx> · Tue, 4 Jan 2022 15:04:27 +0000

My ceph cluster's MDS aren't working. They start up, go into a reconnect state followed by rejoin then they crash and the cycle repeats.
I am running containerized octopus and currently have two file systems. The one that is having problems had to be rebuilt and was working well enough to copy about 3/4 of the data into a new file system until this issue started.
I am happy to post more information from the logs if anyone is interested, but in the interest of making this readable I will try to show just the relevant parts. It looks to me like tthe replay phase is successful and it crashes when trying to reclaim the sessions. Can this be skipped somehow? What would happen if sessions weren't reclaimed?
What are my next steps to debug this
Thanks in advance!

2022-01-04T14:13:22.212+0000 7f36c175c700  1 mds.0.62859 handle_mds_map i am now mds.0.62859
2022-01-04T14:13:22.212+0000 7f36c175c700  1 mds.0.62859 handle_mds_map state change up:rejoin --> up:clientreplay
2022-01-04T14:13:22.212+0000 7f36c175c700  1 mds.0.62859 recovery_done -- successful recovery!
2022-01-04T14:13:22.222+0000 7f36c175c700  1 mds.0.62859 clientreplay_start
2022-01-04T14:13:22.222+0000 7f36c175c700  1 mds.0.62859  still have 0 requests need to be replayed, 18 sessions need to be reclaimed
2022-01-04T14:13:22.427+0000 7f36bc752700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/mds/CDir.cc: In function 'void CDir::try_remove_dentries_for_stray()' thread 7f36bc752700 time 2022-01-04T14:13:22.427081+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/mds/CDir.cc: 762: FAILED ceph_assert(dn->get_linkage()->is_null())

ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f36c9580d10]
2: (()+0x27af2a) [0x7f36c9580f2a]
3: (CDir::try_remove_dentries_for_stray()+0x269) [0x56437c970469]
4: (MDCache::clear_dirty_bits_for_stray(CInode*)+0x1c7) [0x56437c8625e7]
5: (StrayManager::_eval_stray(CDentry*)+0x3ba) [0x56437c8d123a]
6: (StrayManager::eval_stray(CDentry*)+0x1f) [0x56437c8d1baf]
7: (MDCache::scan_stray_dir(dirfrag_t)+0x1dc) [0x56437c83f64c]
8: (MDCache::populate_mydir()+0x510) [0x56437c83fd30]
9: (MDCache::open_root()+0xd8) [0x56437c8523f8]
10: (C_MDS_RetryOpenRoot::finish(int)+0x27) [0x56437c8c9347]
11: (MDSContext::complete(int)+0x56) [0x56437ca182a6]
12: (MDSRank::_advance_queues()+0x8c) [0x56437c74f6dc]
13: (MDSRank::ProgressThread::entry()+0xc5) [0x56437c74fde5]
14: (()+0x814a) [0x7f36c815d14a]
15: (clone()+0x43) [0x7f36c6c7df23]

2022-01-04T14:13:22.429+0000 7f36bc752700 -1 *** Caught signal (Aborted) **
in thread 7f36bc752700 thread_name:mds_rank_progr

ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
1: (()+0x12b20) [0x7f36c8167b20]
2: (gsignal()+0x10f) [0x7f36c6bb87ff]
3: (abort()+0x127) [0x7f36c6ba2c35]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f36c9580d61]
5: (()+0x27af2a) [0x7f36c9580f2a]
6: (CDir::try_remove_dentries_for_stray()+0x269) [0x56437c970469]
7: (MDCache::clear_dirty_bits_for_stray(CInode*)+0x1c7) [0x56437c8625e7]
8: (StrayManager::_eval_stray(CDentry*)+0x3ba) [0x56437c8d123a]
9: (StrayManager::eval_stray(CDentry*)+0x1f) [0x56437c8d1baf]
10: (MDCache::scan_stray_dir(dirfrag_t)+0x1dc) [0x56437c83f64c]
11: (MDCache::populate_mydir()+0x510) [0x56437c83fd30]
12: (MDCache::open_root()+0xd8) [0x56437c8523f8]
13: (C_MDS_RetryOpenRoot::finish(int)+0x27) [0x56437c8c9347]
14: (MDSContext::complete(int)+0x56) [0x56437ca182a6]
15: (MDSRank::_advance_queues()+0x8c) [0x56437c74f6dc]
16: (MDSRank::ProgressThread::entry()+0xc5) [0x56437c74fde5]
17: (()+0x814a) [0x7f36c815d14a]
18: (clone()+0x43) [0x7f36c6c7df23]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-9999> 2022-01-04T14:13:22.327+0000 7f36bc752700  4 mds.0.purge_queue push: pushing inode 0x1000038fd64
-9998> 2022-01-04T14:13:22.327+0000 7f36bc752700  4 mds.0.purge_queue push: pushing inode 0x10000389f16
-9997> 2022-01-04T14:13:22.327+0000 7f36bc752700  4 mds.0.purge_queue push: pushing inode 0x200000187f5
-9996> 2022-01-04T14:13:22.327+0000 7f36bc752700  4 mds.0.purge_queue push: pushing inode 0x200000300e0
-9995> 2022-01-04T14:13:22.327+0000 7f36bc752700  4 mds.0.purge_queue push: pushing inode 0x20000018f69
-9994> 2022-01-04T14:13:22.327+0000 7f36bc752700  4 mds.0.purge_queue push: pushing inode 0x10000388b01

# ceph fs status
recovery-fs - 21 clients
===========
RANK    STATE              MDS            ACTIVITY   DNS    INOS
0    reconnect  burnsmds.burns-6.mmxnkw            99.0k  9235
   POOL       TYPE     USED  AVAIL
recovery   metadata  6021M  5303G
burns.data    data     644T   787T
burnsfs - 1 clients
=======
RANK  STATE             MDS               ACTIVITY     DNS    INOS
0    active  burnsmds.burns-2.hzqvvv  Reqs:    0 /s   209k   197k
1    active  burnsmds.burns-1.btdfhp  Reqs:    0 /s  4420   1838
    POOL        TYPE     USED  AVAIL
burnsfs.meta  metadata  8033M  5303G
burnsfs.data    data     286T   787T
      STANDBY MDS
burnsmds.burns-3.zliilk
MDS version: ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx