Spurious disconnections / connectivity loss

Gordan Bobic <gordan@xxxxxxxxxx> · Fri, 29 Jan 2010 18:41:10 +0000

I'm seeing things like this in the logs, coupled with things locking up 
for a while until the timeout is complete:

[2010-01-29 18:29:01] E 
[client-protocol.c:415:client_ping_timer_expired] home2: Server 
10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
[2010-01-29 18:29:01] E 
[client-protocol.c:415:client_ping_timer_expired] home2: Server 
10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.

The thing is, I know for a fact that there is no network outage of any 
sort. All the machines are on a local gigabit ethernet, and there is no 
connectivity loss observed anywhere else. ssh sessions going to the 
machines that are supposedly "not responding" remain alive and well, 
with no lag.

The NICs in all the servers are a mix of Marvell (using the Marvell 
sk98lin driver) and Realtek (using the Realtek r8168 driver) - none of 
which have exhibited any other observable problems in use.

In 42 seconds, TCP would have re-transmitted if the packets really have 
gotten lost, so I'm not convinced it's packet loss (glfs uses TCP, 
right?). If it's not packet loss, then that implies that glfs daemons 
get stuck somewhere and either miss or ignore the packets in question. 
It smells like a bug, and it's not a new one, either - I have observed 
this in 2.0.x, too. It typically happens under heavy load (e.g. 
resyncing a volume to an empty server, or doing "ls -laR" on a volume to 
make sure it's up to date on all servers. In such cases, the network 
bandwidth used is nowhere near what the network can handle, nor are the 
CPUs in the servers anywhere near being maxed out - most of the time is 
spent waiting for the latencies (ping and context switches) to catch up. 
So I don't think it's a load (CPU or network) issue.

Is there a way to help debug this further?

Gordan