Quoting Gregory Farnum (gfarnum@xxxxxxxxxx): > > Ah, there's a misunderstanding here — the output isn't terribly clear. > "is_healthy" is the name of a *function* in the source code. The line > > heartbeat_map is_healthy 'MDSRank' had timed out after 15 > > is telling you that the heartbeat_map's is_healthy function is running, and > it finds that "'MDSRank' had timed out after 15 [seconds]". So the thread > MDSRank is *not* healthy, it didn't check in for 15 seconds! Therefore the > MDS beacon code decides not to send a beacon, because it thinks the MDS > might be stuck. Thanks for the explanation. > From what you've described here, it's most likely that the MDS is trying to > read something out of RADOS which is taking a long time, and which we > didn't expect to cause a slow down. You can check via the admin socket to > see if there are outstanding Objecter requests or ops_in_flight to get a > clue. Hmm, I avoided that because of this issue [1]. Killing the MDS while debugging why it's hanging is defeating the purpose ;-). I might check for "Objecter requests". Thanks! Gr. Stefan [1]: http://tracker.ceph.com/issues/26894 -- | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com