Hello! Our CephFS mds cluster consists of 3 ranks. We had a minor issue with the network the ceph runs on. And after that cephfs became unavaialble: rank 1 and 2 stuck in rejoin rank 0 can't get pass 'resolve' state and keeps getting blacklisted I checked the logs (with debug_mds 5/5) on the rank 0 mds server and found out the following: goes through 'replay' fine then 'resolve' starts and the log gets flooded with messages like -18> 2020-03-04 16:59:56.934 7f77445f0700 5 mds.0.log _submit_thread 442443596224~41 : EImportFinish 0x30000412462 failed -17> 2020-03-04 16:59:56.950 7f77445f0700 5 mds.0.log _submit_thread 442443596285~41 : EImportFinish 0x3000041246c failed -16> 2020-03-04 16:59:56.966 7f77445f0700 5 mds.0.log _submit_thread 442443596346~41 : EImportFinish 0x3000041247b failed -15> 2020-03-04 16:59:56.983 7f77445f0700 5 mds.0.log _submit_thread 442443596407~41 : EImportFinish 0x30000412485 failed then messages about heartbeat errors start showing up (in between) -3210> 2020-03-04 16:59:04.079 7f77485f8700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 -3209> 2020-03-04 16:59:04.079 7f77485f8700 0 mds.beacon.ceph-server11.ibnet Skipping beacon heartbeat to monitors (last acked 8.00204s ago); MDS internal heartbeat is not healthy! the 'flood' ends with messages -14> 2020-03-04 16:59:57.001 7f77455f2700 -1 mds.0.journaler.mdlog(rw) _finish_write_head got (108) Cannot send after transport endpoint shutdown -13> 2020-03-04 16:59:57.001 7f77455f2700 -1 mds.0.journaler.mdlog(rw) handle_write_error (108) Cannot send after transport endpoint shutdown after wich the mds-server gets blacklisted and becomes a 'standby'. And then the same scenario happens to the standby. Also in my attempt to recover the fs I managed to make it worse. I executed cephfs-table-tool 0 reset session and now the mds daemon crashes at 'replay' with the following error -2> 2020-03-04 22:00:54.228 7f4ca6e44700 -1 log_channel(cluster) log [ERR] : error replaying open sessions(1) sessionmap v 7348424 table 0 -1> 2020-03-04 22:00:54.229 7f4ca6e44700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/SessionMap.cc: In function 'void SessionMap::replay_open_sessions(version_t, std::map<client_t, entity_inst_t>&, std::map<client_t, client_metadata_t>&)' thread 7f4ca6e44700 time 2020-03-04 22:00:54.229427 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/SessionMap.cc: 750: FAILED ceph_assert(g_conf()->mds_wipe_sessions) ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f4cb7913ac2] 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f4cb7913c90] 3: (()+0x3b618b) [0x55b43860518b] 4: (EImportStart::replay(MDSRank*)+0x4a8) [0x55b4386805f8] 5: (MDLog::_replay_thread()+0x8ee) [0x55b43861b3ae] 6: (MDLog::ReplayThread::entry()+0xd) [0x55b43838fecd] 7: (()+0x7e25) [0x7f4cb57efe25] 8: (clone()+0x6d) [0x7f4cb46b234d] 0> 2020-03-04 22:00:54.230 7f4ca6e44700 -1 *** Caught signal (Aborted) ** in thread 7f4ca6e44700 thread_name:md_log_replay About the setup ceph version 14.2.4 OS Centos 7.4 3.10.0-693.5.2.el7.x86_64 Any help would be greatly appreciated! Best regards, Anastasia Belyaeva С уважением, Анастасия Беляева Best regards, Anastasia Belyaeva С уважением, Анастасия Беляева _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx