Re: MDS crash Luminous

John Spray <jspray@xxxxxxxxxx> · Sun, 25 Feb 2018 20:56:25 +0000

On Sat, Feb 24, 2018 at 10:13 AM, David C <dcsysengineer@xxxxxxxxx> wrote:
> Hi All
>
> I had an MDS go down on a 12.2.1 cluster, the standby took over but I don't
> know what caused the issue. Scrubs are scheduled to start at 23:00 on this
> cluster but this appears to have started a minute before.
>
> Can anyone help me with diagnosing this please. Here's the relevant bit from
> the MDS log:

The messages about the heartbeat map not being healthy are a sign that
somewhere in the MDS a thread is getting stuck and not letting others
get in there to do work.  The daemon responds to that by stopping
sending beacons to the monitors, who in turn blacklist the misbehaving
MDS daemon.

You'll have a better shot at working out what got jammed up if "debug
mds" is set to something like 7, or if this is happening predictably
at 22:59:30 you could even attach gdb to the running process and grab
a backtrace of all threads.

John

> 2018-02-23 22:59:30.702915 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:32.960228 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:34.703001 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:342018-02-23 22:59:02.702284 7f26e0612700  1 heartbeat_map
> is_healthy 'MDSRank' had timed out after 15
> 2018-02-23 22:59:02.702334 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:02.959726 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:06.702354 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:06.702366 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:07.959804 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:10.702421 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:10.702434 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:12.959876 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:14.702522 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:14.702535 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:17.959985 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:18.702645 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:18.702670 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:22.702742 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:22.702754 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:22.960063 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:26.702841 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:26.702854 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:27.960141 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:30.702903 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> .703014 7f26e0612700  1 mds.beacon.mdshostname _send skipping beacon,
> heartbeat map not healthy
> 2018-02-23 22:59:37.960301 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:38.703063 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:38.703075 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:42.703147 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:42.703160 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:42.960414 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:46.703209 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:46.703222 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:47.960487 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:50.703305 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:50.703319 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:52.960569 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:54.703365 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:54.703377 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:57.960642 7f26e461a700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:58.703447 7f26e0612700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:58.703461 7f26e0612700  1 mds.beacon.mdshostname _send
> skipping beacon, heartbeat map not healthy
> 2018-02-23 22:59:59.717665 7f26e0e13700  1 heartbeat_map reset_timeout
> 'MDSRank' had timed out after 15
> 2018-02-23 22:59:59.719194 7f26dd60c700 -1 mds.0.journaler.mdlog(rw)
> _finish_write_head got (108) Cannot send after transport endpoint shutdown
> 2018-02-23 22:59:59.719215 7f26dd60c700 -1 mds.0.journaler.mdlog(rw)
> handle_write_error (108) Cannot send after transport endpoint shutdown
> 2018-02-23 22:59:59.719223 7f26dd60c700 -1 mds.0.journaler.mdlog(rw)
> _finish_flush got (108) Cannot send after transport endpoint shutdown
> 2018-02-23 22:59:59.719228 7f26dd60c700 -1 mds.0.journaler.mdlog(rw)
> handle_write_error (108) Cannot send after transport endpoint shutdown
> 2018-02-23 22:59:59.719232 7f26dd60c700 -1 mds.0.journaler.mdlog(rw)
> handle_write_error: multiple write errors, handler already called
> 2018-02-23 22:59:59.719240 7f26dd60c700 -1 MDSIOContextBase: blacklisted!
> Restarting...
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com