On Wed, Oct 17, 2018 at 5:01 AM Ricardo Dias <rdias@xxxxxxxx> wrote: > From the above history, the strange thing that I see is that the > EventManager didn't call the handle_write on the messenger connection > since 19:57:45 until the connection is stopped by the MonClient. > This is the cause for the keepalives and mdsbeacon messages to not be > send to the monitor. > But I don't quite understand why this happens. Maybe the EventManager is > too busy handling other events? It's possible. It may also be relevant that the MDS is running in valgrind. I'm goign to see if we have another instance of this in testing without valgrind. > Also, let's imagine that the EventManager is thrashing and really takes > that long to issue a handle_write in the connection. Shouldn't the MDS > be aware that the MonClient might restart the connection to the monitor > due to not receiving keepalives ack, and take care of that situation? The MonClient keepalives predate the MDS's MonClient restarts. We just haven't gotten around to getting rid of that: http://tracker.ceph.com/issues/36493 > In the meantime, I'm going to look at the code of the EventManager > (which wasn't changed by the messenger refactorings) to understand why > the above situation happened. Thanks Ricardo for the analysis and for looking into this further. -- Patrick Donnelly