Re: MDS hangs in "heartbeat_map" deadlock

Stefan Kooman <stefan@xxxxxx> · Fri, 5 Oct 2018 22:38:41 +0200

Quoting Gregory Farnum (gfarnum@xxxxxxxxxx):
> 
> Ah, there's a misunderstanding here — the output isn't terribly clear.
> "is_healthy" is the name of a *function* in the source code. The line
> 
> heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 
> is telling you that the heartbeat_map's is_healthy function is running, and
> it finds that "'MDSRank' had timed out after 15 [seconds]". So the thread
> MDSRank is *not* healthy, it didn't check in for 15 seconds! Therefore the
> MDS beacon code decides not to send a beacon, because it thinks the MDS
> might be stuck.

Thanks for the explanation.

> From what you've described here, it's most likely that the MDS is trying to
> read something out of RADOS which is taking a long time, and which we
> didn't expect to cause a slow down. You can check via the admin socket to
> see if there are outstanding Objecter requests or ops_in_flight to get a
> clue.

Hmm, I avoided that because of this issue [1]. Killing the MDS while
debugging why it's hanging is defeating the purpose ;-).

I might check for "Objecter requests".

Thanks!

Gr. Stefan

[1]: http://tracker.ceph.com/issues/26894

-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com