Re: Spurious disconnections / connectivity loss

Anand Avati <avati@xxxxxxxxxxx> · Sat, 30 Jan 2010 00:19:41 +0530

Gordan,
  can you post the complete client log from the time of mount?

Avati

On Sat, Jan 30, 2010 at 12:11 AM, Gordan Bobic <gordan@xxxxxxxxxx> wrote:
> I'm seeing things like this in the logs, coupled with things locking up for
> a while until the timeout is complete:
>
> [2010-01-29 18:29:01] E [client-protocol.c:415:client_ping_timer_expired]
> home2: Server 10.2.0.10:6997 has not responded in the last 42 seconds,
> disconnecting.
> [2010-01-29 18:29:01] E [client-protocol.c:415:client_ping_timer_expired]
> home2: Server 10.2.0.10:6997 has not responded in the last 42 seconds,
> disconnecting.
>
> The thing is, I know for a fact that there is no network outage of any sort.
> All the machines are on a local gigabit ethernet, and there is no
> connectivity loss observed anywhere else. ssh sessions going to the machines
> that are supposedly "not responding" remain alive and well, with no lag.
>
> The NICs in all the servers are a mix of Marvell (using the Marvell sk98lin
> driver) and Realtek (using the Realtek r8168 driver) - none of which have
> exhibited any other observable problems in use.
>
> In 42 seconds, TCP would have re-transmitted if the packets really have
> gotten lost, so I'm not convinced it's packet loss (glfs uses TCP, right?).
> If it's not packet loss, then that implies that glfs daemons get stuck
> somewhere and either miss or ignore the packets in question. It smells like
> a bug, and it's not a new one, either - I have observed this in 2.0.x, too.
> It typically happens under heavy load (e.g. resyncing a volume to an empty
> server, or doing "ls -laR" on a volume to make sure it's up to date on all
> servers. In such cases, the network bandwidth used is nowhere near what the
> network can handle, nor are the CPUs in the servers anywhere near being
> maxed out - most of the time is spent waiting for the latencies (ping and
> context switches) to catch up. So I don't think it's a load (CPU or network)
> issue.
>
> Is there a way to help debug this further?
>
> Gordan
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>