MDS stuck in replay/resolve stats

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I encountered an issue lately. I have a cephfs cluster on 14.2.11
with 5 active MDS and 5 stand-replay MDS. Metadata pool is on SSD
and datapool is on SATA. 2 of MDS restart frequently and the replay
MDS stuck into replay and resolve states and never active.

What's wrong with my MDS?

*the restart MDS logs:*
  -214> 2021-03-15 17:39:00.898 7fafe3456700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
  -213> 2021-03-15 17:39:00.898 7fafe3456700  0 mds.beacon.aiceph-08
Skipping beacon heartbeat to monitors (last acked 54.511s ago); MDS
internal heartbeat is not healthy!
  -212> 2021-03-15 17:39:00.911 7fafdf44e700  5 mds.3.migrator
export_finish [dir 0x2000e252701.010100000110* /some/dirs/ [2,head]
auth{2=2} v=143307 cv=0/0 dir_auth=2,3 state=
1073742850|frozentree|exporting f(v3086 m2021-03-12 04:21:43.383480
1303=1303+0) n(v3128 rc2021-03-12 04:21:43.384480 b119066284 1303=1303+0)
hs=0+0,ss=0+0 | ptrwaiter=0 request=0 frozen=1 subtree=1 importing=0
replicated=1 waiter=0 authpin=0 tempexporting=1 0x55df67f36f00]
......
  -204> 2021-03-15 17:39:01.398 7fafe3456700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
  -203> 2021-03-15 17:39:01.398 7fafe3456700  0 mds.beacon.aiceph-08
Skipping beacon heartbeat to monitors (last acked 55.011s ago); MDS
internal heartbeat is not healthy!
  -202> 2021-03-15 17:39:01.401 7fafe5c5b700  1 mds.beacon.aiceph-08 MDS
connection to Monitors appears to be laggy; 55.014s since last acked beacon
  -201> 2021-03-15 17:39:01.401 7fafe5c5b700  5 mds.3.78900  laggy,
deferring export_prep(0x2000e252701.110110011001*) v1
  -200> 2021-03-15 17:39:01.402 7fafe3c57700  1 heartbeat_map reset_timeout
'MDSRank' had timed out after 15
  -199> 2021-03-15 17:39:01.402 7fafe3c57700  1 mds.3.78900 skipping upkeep
work because connection to Monitors appears laggy
  -198> 2021-03-15 17:39:01.402 7fafe5c5b700  1 mds.3.objecter
ms_handle_reset 0x55dead0ba800 session 0x55dd8944e160 osd.24
  -197> 2021-03-15 17:39:01.402 7fafe5c5b700  4 mgrc ms_handle_reset
ms_handle_reset con 0x55e09c44b400
  -196> 2021-03-15 17:39:01.402 7fafe5c5b700  4 mgrc reconnect Terminating
session with v2:172.25.64.235:6800/35103
  -195> 2021-03-15 17:39:01.402 7fafe8eea700 10 monclient: get_auth_request
con 0x55e15190f000 auth_method 0
  -194> 2021-03-15 17:39:01.402 7fafe5c5b700  4 mgrc reconnect Starting new
session with [v2:172.25.64.235:6800/35103,v1:172.25.64.235:6801/35103]
  -193> 2021-03-15 17:39:01.402 7fafe86e9700 10 monclient: get_auth_request
con 0x55dead0ba800 auth_method 0
  -192> 2021-03-15 17:39:01.467 7fafe1452700  2 mds.3.cache Memory usage:
 total 60577760, rss 7823000, heap 315644, baseline 315644, 28365 / 2610243
inodes have caps, 33114 caps, 0.0126862 caps per inode
  -191> 2021-03-15 17:39:01.467 7fafe5c5b700  4 mds.3.78900 handle_osd_map
epoch 12458, 2 new blacklist entries
  -190> 2021-03-15 17:39:01.467 7fafe5c5b700 10 monclient: _renew_subs
  -189> 2021-03-15 17:39:01.467 7fafe5c5b700 10 monclient:
_send_mon_message to mon.aiceph-03 at v2:172.25.72.42:3300/0
  -188> 2021-03-15 17:39:01.467 7fafe5c5b700  4 mgrc ms_handle_reset
ms_handle_reset con 0x55dead0ba800
  -187> 2021-03-15 17:39:01.467 7fafe5c5b700  4 mgrc reconnect Terminating
session with v2:172.25.64.235:6800/35103
  -186> 2021-03-15 17:39:01.467 7fafe5c5b700  4 mgrc reconnect waiting to
retry connect until 2021-03-15 17:39:02.403432
  -185> 2021-03-15 17:39:01.467 7fafdf44e700  5 mds.3.migrator
export_finish [dir 0x2000e252701.010100001110*
/ai_root/NLP/Member/zhangpengshen/mode_repo/phoneme_recognizer/wav2letter_simple/4000h_npy/npy/
[2,head] auth{2=2} v=142958 cv=0/0 dir_auth=2,3 state=
1073742850|frozentree|exporting f(v3086 m2021-03-12 04:21:26.970694
1333=1333+0) n(v3128 rc2021-03-12 04:21:26.970694 b121920160 1333=1333+0)
hs=0+0,ss=0+0 | ptrwaiter=0 request=0 frozen=1 subtree=1 importing=0
replicated=1 waiter=0 authpin=0 tempexporting=1 0
x55df671ebe00]
......
  -177> 2021-03-15 17:39:01.898 7fafe3456700  5 mds.beacon.aiceph-08
Sending beacon up:active seq 1364089
  -176> 2021-03-15 17:39:01.898 7fafe3456700 10 monclient:
_send_mon_message to mon.aiceph-03 at v2:172.25.72.42:3300/0
......
    -3> 2021-03-15 17:39:11.669 7fafdf44e700  5 mds.3.migrator
export_finish [dir 0x2000e252701.010111101010* /some/dir/ [2,head]
auth{2=2} v=143028 cv=143028/143028 dir_auth=2,3
state=1073742850|frozentree|exporting f(v3086 m2021-03-12 04:22:17.528035
1340=1340+0) n(v3128 rc2021-03-12 04:22:17.529035 b125945332 1340=1340+0)
hs=0+0,ss=0+0 | ptrwaiter=0 request=0 frozen=1 subtree=1 importing=0
replicated=1 waiter=0 authpin=0 tempexporting=1 0x55e029a94500]
    -2> 2021-03-15 17:39:11.728 7fafe5c5b700  1 mds.aiceph-08 Updating MDS
map to version 199401 from mon.2
    -1> 2021-03-15 17:39:11.728 7fafe5c5b700  1 mds.aiceph-08 Map removed
me [mds.aiceph-08{3:977486} state up:active seq 1364038 export targets
0,1,2 addr [v2:172.25.72.47:6960/1073018517,v1:172.25.72.47:6961/1073018517]]
from cluster; respawning! See cluster/monitor logs for details.
     0> 2021-03-15 17:39:11.728 7fafe5c5b700  1 mds.aiceph-08 respawn!

*the resolve MDS logs:*
2021-03-15 17:39:12.427 7f6c0b139700  1 mds.aiceph-09 Updating MDS map to
version 199401 from mon.0
2021-03-15 17:39:41.092 7f6c0b139700  1 mds.aiceph-09 Updating MDS map to
version 199402 from mon.0
2021-03-15 17:39:41.111 7f6c0b139700  1 mds.aiceph-09 Updating MDS map to
version 199403 from mon.0
2021-03-15 17:39:41.111 7f6c0b139700  1 mds.2.199403 handle_mds_map i am
now mds.2.199403
2021-03-15 17:39:41.111 7f6c0b139700  1 mds.2.199403 handle_mds_map state
change up:standby-replay --> up:replay
2021-03-15 17:39:41.715 7f6c0492c700  1 mds.2.199403 standby_replay_restart
(final takeover pass)
2021-03-15 17:39:41.715 7f6c0492c700  1 mds.2.199403  opening purge_queue
(async)
2021-03-15 17:39:41.715 7f6c0492c700  1 mds.2.199403  opening
open_file_table (async)
2021-03-15 17:39:41.716 7f6c0492c700  1 mds.2.199403 Finished replaying
journal
2021-03-15 17:39:41.716 7f6c0492c700  1 mds.2.199403 making mds journal
writeable
2021-03-15 17:39:42.116 7f6c0b139700  1 mds.aiceph-09 Updating MDS map to
version 199404 from mon.0
2021-03-15 17:39:42.116 7f6c0b139700  1 mds.2.199403 handle_mds_map i am
now mds.2.199403
2021-03-15 17:39:42.116 7f6c0b139700  1 mds.2.199403 handle_mds_map state
change up:replay --> up:resolve
2021-03-15 17:39:42.116 7f6c0b139700  1 mds.2.199403 resolve_start
2021-03-15 17:39:42.116 7f6c0b139700  1 mds.2.199403 reopen_log
2021-03-15 17:40:53.758 7f6c0b139700  1 mds.aiceph-09 Updating MDS map to
version 199407 from mon.0
2021-03-15 17:41:09.478 7f6c0b139700  1 mds.aiceph-09 Updating MDS map to
version 199408 from mon.0
2021-03-15 17:42:05.671 7f6c0b139700  1 mds.aiceph-09 Updating MDS map to
version 199409 from mon.0
2021-03-15 17:42:15.704 7f6c0b139700  1 mds.aiceph-09 Updating MDS map to
version 199410 from mon.0

*the replay MDS logs:*
2021-03-15 17:37:55.079 7f7a6df1c700  1 mds.aiceph-05 Updating MDS map to
version 199400 from mon.0
2021-03-15 17:38:52.763 7f7a6df1c700  1 mds.aiceph-05 Updating MDS map to
version 199401 from mon.0
2021-03-15 17:38:52.763 7f7a6df1c700  1 mds.3.199401 handle_mds_map i am
now mds.3.199401
2021-03-15 17:38:52.763 7f7a6df1c700  1 mds.3.199401 handle_mds_map state
change up:standby-replay --> up:replay
2021-03-15 17:39:21.428 7f7a6df1c700  1 mds.aiceph-05 Updating MDS map to
version 199402 from mon.0
2021-03-15 17:39:21.447 7f7a6df1c700  1 mds.aiceph-05 Updating MDS map to
version 199403 from mon.0
2021-03-15 17:39:22.452 7f7a6df1c700  1 mds.aiceph-05 Updating MDS map to
version 199404 from mon.0
2021-03-15 17:40:34.093 7f7a6df1c700  1 mds.aiceph-05 Updating MDS map to
version 199407 from mon.0
2021-03-15 17:40:49.813 7f7a6df1c700  1 mds.aiceph-05 Updating MDS map to
version 199408 from mon.0
2021-03-15 17:41:46.006 7f7a6df1c700  1 mds.aiceph-05 Updating MDS map to
version 199409 from mon.0
2021-03-15 17:41:56.039 7f7a6df1c700  1 mds.aiceph-05 Updating MDS map to
version 199410 from mon.0
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux