Re: MDS hangs in "heartbeat_map" deadlock

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Mon, 8 Oct 2018 15:00:44 -0700



On Thu, Oct 4, 2018 at 3:58 PM Stefan Kooman <stefan@xxxxxx> wrote:
> A couple of hours later we hit the same issue. We restarted with
> debug_mds=20 and debug_journaler=20 on the standby-replay node. Eight
> hours later (an hour ago) we hit the same issue. We captured ~ 4.7 GB of
> logging.... I skipped to the end of the log file just before the
> "hearbeat_map" messages start:
>
> 2018-10-04 23:23:53.144644 7f415ebf4700 20 mds.0.locker  client.17079146 pending pAsLsXsFscr allowed pAsLsXsFscr wanted pFscr
> 2018-10-04 23:23:53.144645 7f415ebf4700 10 mds.0.locker eval done
> 2018-10-04 23:23:55.088542 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 5021
> 2018-10-04 23:23:59.088602 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 5022
> 2018-10-04 23:24:03.088688 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 5023
> 2018-10-04 23:24:07.088775 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 5024
> 2018-10-04 23:24:11.088867 7f415bbee700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 2018-10-04 23:24:11.088871 7f415bbee700  1 mds.beacon.mds2 _send skipping beacon, heartbeat map not healthy
>
> As far as I can see just normal behaviour.
>
> The big question is: what is happening when the mds start logging the hearbeat_map messages?
> Why does it log "heartbeat_map is_healthy", just to log .000004 seconds later it's not healthy?
>
> Ceph version: 12.2.8 on all nodes (mon, osd, mds)
> mds: one active / one standby-replay
>
> The system was not under any kind of resource pressure: plenty of CPU, RAM
> available. Metrics all look normal up to the moment things go into a deadlock
> (so it seems).

Thanks for the detailed notes. It looks like the MDS is stuck
somewhere it's not even outputting any log messages. If possible, it'd
be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
if you're comfortable with gdb, a backtrace of any threads that look
suspicious (e.g. not waiting on a futex) including `info threads`.
-- 
Patrick Donnelly
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com