Re: Spurious disconnections / connectivity loss

Stephan von Krawczynski <skraw@xxxxxxxxxx> · Sat, 30 Jan 2010 12:08:29 +0100

On Fri, 29 Jan 2010 18:41:10 +0000
Gordan Bobic <gordan@xxxxxxxxxx> wrote:

> I'm seeing things like this in the logs, coupled with things locking up 
> for a while until the timeout is complete:
> 
> [2010-01-29 18:29:01] E 
> [client-protocol.c:415:client_ping_timer_expired] home2: Server 
> 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
> [2010-01-29 18:29:01] E 
> [client-protocol.c:415:client_ping_timer_expired] home2: Server 
> 10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
> 
> The thing is, I know for a fact that there is no network outage of any 
> sort. All the machines are on a local gigabit ethernet, and there is no 
> connectivity loss observed anywhere else. ssh sessions going to the 
> machines that are supposedly "not responding" remain alive and well, 
> with no lag.

What you're seeing here is exactly what made us increase the ping-timeout to
120.
To us it is obvious that the keep alive strategy does not cope with minimal
packet loss. On _every_ network you can see packet loss (read the docs of your
switch carefully). We had the impression that the strategy implemented is not
aware of the fact that a lost ping packet is no proof for a disconnected
server but only a hint for a closer look.

-- 
Regards,
Stephan