I'm seeing things like this in the logs, coupled with things locking up
for a while until the timeout is complete:
[2010-01-29 18:29:01] E
[client-protocol.c:415:client_ping_timer_expired] home2: Server
10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
[2010-01-29 18:29:01] E
[client-protocol.c:415:client_ping_timer_expired] home2: Server
10.2.0.10:6997 has not responded in the last 42 seconds, disconnecting.
The thing is, I know for a fact that there is no network outage of any
sort. All the machines are on a local gigabit ethernet, and there is no
connectivity loss observed anywhere else. ssh sessions going to the
machines that are supposedly "not responding" remain alive and well,
with no lag.
The NICs in all the servers are a mix of Marvell (using the Marvell
sk98lin driver) and Realtek (using the Realtek r8168 driver) - none of
which have exhibited any other observable problems in use.
In 42 seconds, TCP would have re-transmitted if the packets really have
gotten lost, so I'm not convinced it's packet loss (glfs uses TCP,
right?). If it's not packet loss, then that implies that glfs daemons
get stuck somewhere and either miss or ignore the packets in question.
It smells like a bug, and it's not a new one, either - I have observed
this in 2.0.x, too. It typically happens under heavy load (e.g.
resyncing a volume to an empty server, or doing "ls -laR" on a volume to
make sure it's up to date on all servers. In such cases, the network
bandwidth used is nowhere near what the network can handle, nor are the
CPUs in the servers anywhere near being maxed out - most of the time is
spent waiting for the latencies (ping and context switches) to catch up.
So I don't think it's a load (CPU or network) issue.
Is there a way to help debug this further?
Gordan