Hi, I encountered an issue lately. I have a cephfs cluster on 14.2.11 with 5 active MDS and 5 stand-replay MDS. Metadata pool is on SSD and datapool is on SATA. 2 of MDS restart frequently and the replay MDS stuck into replay and resolve states and never active. What's wrong with my MDS? *the restart MDS logs:* -214> 2021-03-15 17:39:00.898 7fafe3456700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 -213> 2021-03-15 17:39:00.898 7fafe3456700 0 mds.beacon.aiceph-08 Skipping beacon heartbeat to monitors (last acked 54.511s ago); MDS internal heartbeat is not healthy! -212> 2021-03-15 17:39:00.911 7fafdf44e700 5 mds.3.migrator export_finish [dir 0x2000e252701.010100000110* /some/dirs/ [2,head] auth{2=2} v=143307 cv=0/0 dir_auth=2,3 state= 1073742850|frozentree|exporting f(v3086 m2021-03-12 04:21:43.383480 1303=1303+0) n(v3128 rc2021-03-12 04:21:43.384480 b119066284 1303=1303+0) hs=0+0,ss=0+0 | ptrwaiter=0 request=0 frozen=1 subtree=1 importing=0 replicated=1 waiter=0 authpin=0 tempexporting=1 0x55df67f36f00] ...... -204> 2021-03-15 17:39:01.398 7fafe3456700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 -203> 2021-03-15 17:39:01.398 7fafe3456700 0 mds.beacon.aiceph-08 Skipping beacon heartbeat to monitors (last acked 55.011s ago); MDS internal heartbeat is not healthy! -202> 2021-03-15 17:39:01.401 7fafe5c5b700 1 mds.beacon.aiceph-08 MDS connection to Monitors appears to be laggy; 55.014s since last acked beacon -201> 2021-03-15 17:39:01.401 7fafe5c5b700 5 mds.3.78900 laggy, deferring export_prep(0x2000e252701.110110011001*) v1 -200> 2021-03-15 17:39:01.402 7fafe3c57700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 -199> 2021-03-15 17:39:01.402 7fafe3c57700 1 mds.3.78900 skipping upkeep work because connection to Monitors appears laggy -198> 2021-03-15 17:39:01.402 7fafe5c5b700 1 mds.3.objecter ms_handle_reset 0x55dead0ba800 session 0x55dd8944e160 osd.24 -197> 2021-03-15 17:39:01.402 7fafe5c5b700 4 mgrc ms_handle_reset ms_handle_reset con 0x55e09c44b400 -196> 2021-03-15 17:39:01.402 7fafe5c5b700 4 mgrc reconnect Terminating session with v2:172.25.64.235:6800/35103 -195> 2021-03-15 17:39:01.402 7fafe8eea700 10 monclient: get_auth_request con 0x55e15190f000 auth_method 0 -194> 2021-03-15 17:39:01.402 7fafe5c5b700 4 mgrc reconnect Starting new session with [v2:172.25.64.235:6800/35103,v1:172.25.64.235:6801/35103] -193> 2021-03-15 17:39:01.402 7fafe86e9700 10 monclient: get_auth_request con 0x55dead0ba800 auth_method 0 -192> 2021-03-15 17:39:01.467 7fafe1452700 2 mds.3.cache Memory usage: total 60577760, rss 7823000, heap 315644, baseline 315644, 28365 / 2610243 inodes have caps, 33114 caps, 0.0126862 caps per inode -191> 2021-03-15 17:39:01.467 7fafe5c5b700 4 mds.3.78900 handle_osd_map epoch 12458, 2 new blacklist entries -190> 2021-03-15 17:39:01.467 7fafe5c5b700 10 monclient: _renew_subs -189> 2021-03-15 17:39:01.467 7fafe5c5b700 10 monclient: _send_mon_message to mon.aiceph-03 at v2:172.25.72.42:3300/0 -188> 2021-03-15 17:39:01.467 7fafe5c5b700 4 mgrc ms_handle_reset ms_handle_reset con 0x55dead0ba800 -187> 2021-03-15 17:39:01.467 7fafe5c5b700 4 mgrc reconnect Terminating session with v2:172.25.64.235:6800/35103 -186> 2021-03-15 17:39:01.467 7fafe5c5b700 4 mgrc reconnect waiting to retry connect until 2021-03-15 17:39:02.403432 -185> 2021-03-15 17:39:01.467 7fafdf44e700 5 mds.3.migrator export_finish [dir 0x2000e252701.010100001110* /ai_root/NLP/Member/zhangpengshen/mode_repo/phoneme_recognizer/wav2letter_simple/4000h_npy/npy/ [2,head] auth{2=2} v=142958 cv=0/0 dir_auth=2,3 state= 1073742850|frozentree|exporting f(v3086 m2021-03-12 04:21:26.970694 1333=1333+0) n(v3128 rc2021-03-12 04:21:26.970694 b121920160 1333=1333+0) hs=0+0,ss=0+0 | ptrwaiter=0 request=0 frozen=1 subtree=1 importing=0 replicated=1 waiter=0 authpin=0 tempexporting=1 0 x55df671ebe00] ...... -177> 2021-03-15 17:39:01.898 7fafe3456700 5 mds.beacon.aiceph-08 Sending beacon up:active seq 1364089 -176> 2021-03-15 17:39:01.898 7fafe3456700 10 monclient: _send_mon_message to mon.aiceph-03 at v2:172.25.72.42:3300/0 ...... -3> 2021-03-15 17:39:11.669 7fafdf44e700 5 mds.3.migrator export_finish [dir 0x2000e252701.010111101010* /some/dir/ [2,head] auth{2=2} v=143028 cv=143028/143028 dir_auth=2,3 state=1073742850|frozentree|exporting f(v3086 m2021-03-12 04:22:17.528035 1340=1340+0) n(v3128 rc2021-03-12 04:22:17.529035 b125945332 1340=1340+0) hs=0+0,ss=0+0 | ptrwaiter=0 request=0 frozen=1 subtree=1 importing=0 replicated=1 waiter=0 authpin=0 tempexporting=1 0x55e029a94500] -2> 2021-03-15 17:39:11.728 7fafe5c5b700 1 mds.aiceph-08 Updating MDS map to version 199401 from mon.2 -1> 2021-03-15 17:39:11.728 7fafe5c5b700 1 mds.aiceph-08 Map removed me [mds.aiceph-08{3:977486} state up:active seq 1364038 export targets 0,1,2 addr [v2:172.25.72.47:6960/1073018517,v1:172.25.72.47:6961/1073018517]] from cluster; respawning! See cluster/monitor logs for details. 0> 2021-03-15 17:39:11.728 7fafe5c5b700 1 mds.aiceph-08 respawn! *the resolve MDS logs:* 2021-03-15 17:39:12.427 7f6c0b139700 1 mds.aiceph-09 Updating MDS map to version 199401 from mon.0 2021-03-15 17:39:41.092 7f6c0b139700 1 mds.aiceph-09 Updating MDS map to version 199402 from mon.0 2021-03-15 17:39:41.111 7f6c0b139700 1 mds.aiceph-09 Updating MDS map to version 199403 from mon.0 2021-03-15 17:39:41.111 7f6c0b139700 1 mds.2.199403 handle_mds_map i am now mds.2.199403 2021-03-15 17:39:41.111 7f6c0b139700 1 mds.2.199403 handle_mds_map state change up:standby-replay --> up:replay 2021-03-15 17:39:41.715 7f6c0492c700 1 mds.2.199403 standby_replay_restart (final takeover pass) 2021-03-15 17:39:41.715 7f6c0492c700 1 mds.2.199403 opening purge_queue (async) 2021-03-15 17:39:41.715 7f6c0492c700 1 mds.2.199403 opening open_file_table (async) 2021-03-15 17:39:41.716 7f6c0492c700 1 mds.2.199403 Finished replaying journal 2021-03-15 17:39:41.716 7f6c0492c700 1 mds.2.199403 making mds journal writeable 2021-03-15 17:39:42.116 7f6c0b139700 1 mds.aiceph-09 Updating MDS map to version 199404 from mon.0 2021-03-15 17:39:42.116 7f6c0b139700 1 mds.2.199403 handle_mds_map i am now mds.2.199403 2021-03-15 17:39:42.116 7f6c0b139700 1 mds.2.199403 handle_mds_map state change up:replay --> up:resolve 2021-03-15 17:39:42.116 7f6c0b139700 1 mds.2.199403 resolve_start 2021-03-15 17:39:42.116 7f6c0b139700 1 mds.2.199403 reopen_log 2021-03-15 17:40:53.758 7f6c0b139700 1 mds.aiceph-09 Updating MDS map to version 199407 from mon.0 2021-03-15 17:41:09.478 7f6c0b139700 1 mds.aiceph-09 Updating MDS map to version 199408 from mon.0 2021-03-15 17:42:05.671 7f6c0b139700 1 mds.aiceph-09 Updating MDS map to version 199409 from mon.0 2021-03-15 17:42:15.704 7f6c0b139700 1 mds.aiceph-09 Updating MDS map to version 199410 from mon.0 *the replay MDS logs:* 2021-03-15 17:37:55.079 7f7a6df1c700 1 mds.aiceph-05 Updating MDS map to version 199400 from mon.0 2021-03-15 17:38:52.763 7f7a6df1c700 1 mds.aiceph-05 Updating MDS map to version 199401 from mon.0 2021-03-15 17:38:52.763 7f7a6df1c700 1 mds.3.199401 handle_mds_map i am now mds.3.199401 2021-03-15 17:38:52.763 7f7a6df1c700 1 mds.3.199401 handle_mds_map state change up:standby-replay --> up:replay 2021-03-15 17:39:21.428 7f7a6df1c700 1 mds.aiceph-05 Updating MDS map to version 199402 from mon.0 2021-03-15 17:39:21.447 7f7a6df1c700 1 mds.aiceph-05 Updating MDS map to version 199403 from mon.0 2021-03-15 17:39:22.452 7f7a6df1c700 1 mds.aiceph-05 Updating MDS map to version 199404 from mon.0 2021-03-15 17:40:34.093 7f7a6df1c700 1 mds.aiceph-05 Updating MDS map to version 199407 from mon.0 2021-03-15 17:40:49.813 7f7a6df1c700 1 mds.aiceph-05 Updating MDS map to version 199408 from mon.0 2021-03-15 17:41:46.006 7f7a6df1c700 1 mds.aiceph-05 Updating MDS map to version 199409 from mon.0 2021-03-15 17:41:56.039 7f7a6df1c700 1 mds.aiceph-05 Updating MDS map to version 199410 from mon.0 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx