Hi All
I had an MDS go down on a 12.2.1 cluster, the standby took over but I don't know what caused the issue. Scrubs are scheduled to start at 23:00 on this cluster but this appears to have started a minute before.Can anyone help me with diagnosing this please. Here's the relevant bit from the MDS log:
2018-02-23 22:59:30.702915 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:32.960228 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:34.703001 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:342018-02-23 22:59:02.702284 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:02.702334 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:02.959726 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:06.702354 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:06.702366 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:07.959804 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:10.702421 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:10.702434 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:12.959876 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:14.702522 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:14.702535 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:17.959985 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:18.702645 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:18.702670 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:22.702742 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:22.702754 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:22.960063 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:26.702841 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:26.702854 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:27.960141 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:30.702903 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
.703014 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:37.960301 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:38.703063 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:38.703075 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:42.703147 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:42.703160 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:42.960414 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:46.703209 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:46.703222 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:47.960487 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:50.703305 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:50.703319 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:52.960569 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:54.703365 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:54.703377 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:57.960642 7f26e461a700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:58.703447 7f26e0612700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-02-23 22:59:58.703461 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, heartbeat map not healthy
2018-02-23 22:59:59.717665 7f26e0e13700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2018-02-23 22:59:59.719194 7f26dd60c700 -1 mds.0.journaler.mdlog(rw) _finish_write_head got (108) Cannot send after transport endpoint shutdown
2018-02-23 22:59:59.719215 7f26dd60c700 -1 mds.0.journaler.mdlog(rw) handle_write_error (108) Cannot send after transport endpoint shutdown
2018-02-23 22:59:59.719223 7f26dd60c700 -1 mds.0.journaler.mdlog(rw) _finish_flush got (108) Cannot send after transport endpoint shutdown
2018-02-23 22:59:59.719228 7f26dd60c700 -1 mds.0.journaler.mdlog(rw) handle_write_error (108) Cannot send after transport endpoint shutdown
2018-02-23 22:59:59.719232 7f26dd60c700 -1 mds.0.journaler.mdlog(rw) handle_write_error: multiple write errors, handler already called
2018-02-23 22:59:59.719240 7f26dd60c700 -1 MDSIOContextBase: blacklisted! Restarting...
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com