MDS getting stuck on 'resolve' and 'rejoin'

Anastasia Belyaeva <anastasia.blv@xxxxxxxxx> · Wed, 4 Mar 2020 22:50:13 +0300

Hello!

Our CephFS mds cluster consists of 3 ranks. We had a minor issue with the
network the ceph runs on. And after that cephfs became unavaialble:
rank 1 and 2 stuck in rejoin
rank 0 can't get pass 'resolve' state and keeps getting blacklisted

I checked the logs (with debug_mds 5/5) on the rank 0 mds server and found
out the following:

goes through 'replay' fine

then 'resolve' starts and the log gets flooded with messages like

 -18> 2020-03-04 16:59:56.934 7f77445f0700  5 mds.0.log _submit_thread
442443596224~41 : EImportFinish 0x30000412462 failed
   -17> 2020-03-04 16:59:56.950 7f77445f0700  5 mds.0.log _submit_thread
442443596285~41 : EImportFinish 0x3000041246c failed
   -16> 2020-03-04 16:59:56.966 7f77445f0700  5 mds.0.log _submit_thread
442443596346~41 : EImportFinish 0x3000041247b failed
   -15> 2020-03-04 16:59:56.983 7f77445f0700  5 mds.0.log _submit_thread
442443596407~41 : EImportFinish 0x30000412485 failed

then messages about heartbeat errors start showing up (in between)

 -3210> 2020-03-04 16:59:04.079 7f77485f8700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
 -3209> 2020-03-04 16:59:04.079 7f77485f8700  0
mds.beacon.ceph-server11.ibnet Skipping beacon heartbeat to monitors (last
acked 8.00204s ago); MDS internal heartbeat is not healthy!

the 'flood' ends with messages

 -14> 2020-03-04 16:59:57.001 7f77455f2700 -1 mds.0.journaler.mdlog(rw)
_finish_write_head got (108) Cannot send after transport endpoint shutdown
   -13> 2020-03-04 16:59:57.001 7f77455f2700 -1 mds.0.journaler.mdlog(rw)
handle_write_error (108) Cannot send after transport endpoint shutdown

after wich the mds-server gets blacklisted and becomes a 'standby'. And
then the same scenario happens to the standby.

Also in my attempt to recover the fs I managed to make it worse.

I executed

 cephfs-table-tool 0 reset session

and now the mds daemon crashes at 'replay' with the following error

 -2> 2020-03-04 22:00:54.228 7f4ca6e44700 -1 log_channel(cluster) log [ERR]
: error replaying open sessions(1) sessionmap v 7348424 table 0
    -1> 2020-03-04 22:00:54.229 7f4ca6e44700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/SessionMap.cc:
In function 'void SessionMap::replay_open_sessions(version_t,
std::map<client_t, entity_inst_t>&, std::map<client_t,
client_metadata_t>&)' thread 7f4ca6e44700 time 2020-03-04 22:00:54.229427
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/SessionMap.cc:
750: FAILED ceph_assert(g_conf()->mds_wipe_sessions)

 ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14a) [0x7f4cb7913ac2]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*,
char const*, ...)+0) [0x7f4cb7913c90]
 3: (()+0x3b618b) [0x55b43860518b]
 4: (EImportStart::replay(MDSRank*)+0x4a8) [0x55b4386805f8]
 5: (MDLog::_replay_thread()+0x8ee) [0x55b43861b3ae]
 6: (MDLog::ReplayThread::entry()+0xd) [0x55b43838fecd]
 7: (()+0x7e25) [0x7f4cb57efe25]
 8: (clone()+0x6d) [0x7f4cb46b234d]

     0> 2020-03-04 22:00:54.230 7f4ca6e44700 -1 *** Caught signal (Aborted)
**
 in thread 7f4ca6e44700 thread_name:md_log_replay

About the setup

ceph version 14.2.4
OS Centos 7.4 3.10.0-693.5.2.el7.x86_64

Any help would be greatly appreciated!

Best regards,
Anastasia Belyaeva

С уважением,
Анастасия Беляева

Best regards,
Anastasia Belyaeva

С уважением,
Анастасия Беляева
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx