On Fri, 29 Jan 2010 18:41:10 +0000 Gordan Bobic <gordan@xxxxxxxxxx> wrote: > I'm seeing things like this in the logs, coupled with things locking up > for a while until the timeout is complete: > > [2010-01-29 18:29:01] E > [client-protocol.c:415:client_ping_timer_expired] home2: Server > 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting. > [2010-01-29 18:29:01] E > [client-protocol.c:415:client_ping_timer_expired] home2: Server > 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting. > > The thing is, I know for a fact that there is no network outage of any > sort. All the machines are on a local gigabit ethernet, and there is no > connectivity loss observed anywhere else. ssh sessions going to the > machines that are supposedly "not responding" remain alive and well, > with no lag. What you're seeing here is exactly what made us increase the ping-timeout to 120. To us it is obvious that the keep alive strategy does not cope with minimal packet loss. On _every_ network you can see packet loss (read the docs of your switch carefully). We had the impression that the strategy implemented is not aware of the fact that a lost ping packet is no proof for a disconnected server but only a hint for a closer look. -- Regards, Stephan