Re: MDS hangs in "heartbeat_map" deadlock

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 5 Oct 2018 13:28:23 -0700

On Thu, Oct 4, 2018 at 3:58 PM Stefan Kooman <stefan@xxxxxx> wrote:
Dear list,

Today we hit our first Ceph MDS issue. Out of the blue the active MDS

stopped working:

mon.mon1 [WRN] daemon mds.mds1 is not responding, replacing it as rank 0 with standby

daemon mds.mds2.

Logging of ceph-mds1:

2018-10-04 10:50:08.524745 7fdd516bf700 1 mds.mds1 asok_command: status (starting...)

2018-10-04 10:50:08.524782 7fdd516bf700 1 mds.mds1 asok_command: status (complete)

^^ one of our monitoring health checks performing a "ceph daemon mds.mds1 version", business as usual.

2018-10-04 10:52:36.712525 7fdd51ec0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15

2018-10-04 10:52:36.747577 7fdd4deb8700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15

2018-10-04 10:52:36.747584 7fdd4deb8700 1 mds.beacon.mds1 _send skipping beacon, heartbeat map not healthy

^^ the unresponsive mds1 consumes 100% CPU and keeps on logging above heatbeat_map messages.

In the meantime ceph-mds2 has transitioned from "standby-replay" to "active":

mon.mon1 [INF] daemon mds.mds2 is now active in filesystem BITED-153874-cephfs as rank 0

Logging:

replays, final replay as standby, reopen log

2018-10-04 10:52:53.268470 7fdb231d9700 1 mds.0.141 reconnect_done

2018-10-04 10:52:53.759844 7fdb231d9700 1 mds.mds2 Updating MDS map to version 143 from mon.3

2018-10-04 10:52:53.759859 7fdb231d9700 1 mds.0.141 handle_mds_map i am now mds.0.141

2018-10-04 10:52:53.759862 7fdb231d9700 1 mds.0.141 handle_mds_map state change up:reconnect --> up:rejoin

2018-10-04 10:52:53.759868 7fdb231d9700 1 mds.0.141 rejoin_start

2018-10-04 10:52:53.759970 7fdb231d9700 1 mds.0.141 rejoin_joint_start

2018-10-04 10:52:53.760970 7fdb1d1cd700 0 mds.0.cache failed to open ino 0x10000cd95e9 err -5/0

2018-10-04 10:52:54.126658 7fdb1d1cd700 1 mds.0.141 rejoin_done

2018-10-04 10:52:54.770457 7fdb231d9700 1 mds.mds2 Updating MDS map to version 144 from mon.3

2018-10-04 10:52:54.770484 7fdb231d9700 1 mds.0.141 handle_mds_map i am now mds.0.141

2018-10-04 10:52:54.770487 7fdb231d9700 1 mds.0.141 handle_mds_map state change up:rejoin --> up:clientreplay

2018-10-04 10:52:54.770494 7fdb231d9700 1 mds.0.141 recovery_done -- successful recovery!

2018-10-04 10:52:54.770617 7fdb231d9700 1 mds.0.141 clientreplay_start

2018-10-04 10:52:54.882995 7fdb1d1cd700 1 mds.0.141 clientreplay_done

2018-10-04 10:52:55.778598 7fdb231d9700 1 mds.mds2 Updating MDS map to version 145 from mon.3

2018-10-04 10:52:55.778622 7fdb231d9700 1 mds.0.141 handle_mds_map i am now mds.0.141

2018-10-04 10:52:55.778628 7fdb231d9700 1 mds.0.141 handle_mds_map state change up:clientreplay --> up:active

2018-10-04 10:52:55.778638 7fdb231d9700 1 mds.0.141 active_start

2018-10-04 10:52:55.805206 7fdb231d9700 1 mds.0.141 cluster recovered.

And then it _also_ starts to log hearbeat_map messages (and consuming 100% CPU):

en dan deze meldingen die zichzelf blijven herhalen bij 100% cpu

2018-10-04 10:53:41.550793 7fdb241db700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15

2018-10-04 10:53:42.884018 7fdb201d3700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15

2018-10-04 10:53:42.884024 7fdb201d3700 1 mds.beacon.mds2 _send skipping beacon, heartbeat map not healthy

At that point in time there is one active MDS according to ceph, but in reality it's

not functioning correctly (not serving clients at least).

... we stopped both daemons. Restarted one ... recovery ...

works for half a minute ... then starts logging heartbeat_map messages.

Restart again ... works for a little while ... starts logging

heartbeat_map messages again. We restart the mds with debug_mds=20 ....

it keeps working fine. The other mds gets restarted and keeps on

working. We do a couple of failover tests .... works flawlessly

(failover in < 1 second, clients reconnect instantly).  

A couple of hours later we hit the same issue. We restarted with

debug_mds=20 and debug_journaler=20 on the standby-replay node. Eight

hours later (an hour ago) we hit the same issue. We captured ~ 4.7 GB of

logging.... I skipped to the end of the log file just before the

"hearbeat_map" messages start:

2018-10-04 23:23:53.144644 7f415ebf4700 20 mds.0.locker  client.17079146 pending pAsLsXsFscr allowed pAsLsXsFscr wanted pFscr

2018-10-04 23:23:53.144645 7f415ebf4700 10 mds.0.locker eval done

2018-10-04 23:23:55.088542 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 5021

2018-10-04 23:23:59.088602 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 5022

2018-10-04 23:24:03.088688 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 5023

2018-10-04 23:24:07.088775 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 5024

2018-10-04 23:24:11.088867 7f415bbee700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15

2018-10-04 23:24:11.088871 7f415bbee700  1 mds.beacon.mds2 _send skipping beacon, heartbeat map not healthy

As far as I can see just normal behaviour.

The big question is: what is happening when the mds start logging the hearbeat_map messages?

Why does it log "heartbeat_map is_healthy", just to log .000004 seconds later it's not healthy?

Ah, there's a misunderstanding here — the output isn't terribly clear. "is_healthy" is the name of a *function* in the source code. The line

heartbeat_map is_healthy 'MDSRank' had timed out after 15

is telling you that the heartbeat_map's is_healthy function is running, and it finds that "'MDSRank' had timed out after 15 [seconds]". So the thread MDSRank is *not* healthy, it didn't check in for 15 seconds! Therefore the MDS beacon code decides not to send a beacon, because it thinks the MDS might be stuck.

From what you've described here, it's most likely that the MDS is trying to read something out of RADOS which is taking a long time, and which we didn't expect to cause a slow down. You can check via the admin socket to see if there are outstanding Objecter requests or ops_in_flight to get a clue.
-Greg

Ceph version: 12.2.8 on all nodes (mon, osd, mds)

mds: one active / one standby-replay

The system was not under any kind of resource pressure: plenty of CPU, RAM

available. Metrics all look normal up to the moment things go into a deadlock

(so it seems).

Thanks for any hints / clue.

Gr. Stefan

-- 

| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351

| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com