3.4.0alpha3 spurious disconnect

manu@xxxxxxxxxx (Emmanuel Dreyfus) · Thu, 18 Apr 2013 08:08:50 +0200

I still get spurious disconnects in 3.4.0alpha3. While there I note this
patch that has not beeen pulled up to 3.4 branch, while it fixes a
problem I envountered on alpha2:
http://review.gluster.com/#/c/4588/

Here is the first occurence of a spurious disconnect on client side (I
added debug messages)

[2013-04-17 21:07:47.198612] E [socket.c:487:__socket_rwv]
    0-gfs33-client-2: EOF on socket (errno = 0, opcount = 1,
    opvector[0].iov_len = 4
[2013-04-17 21:07:47.198824] W [socket.c:515:__socket_rwv]
    0-gfs33-client-2: readv failed (No message available)
[2013-04-17 21:07:47.198947] W
    [socket.c:1963:__socket_proto_state_machine] 0-gfs33-client-2:
    reading from socket failed. Error (No message available), peer
   (192.0.2.103:49153)
[2013-04-17 21:07:47.199000] I [client.c:2097:client_rpc_notify]
    0-gfs33-client-2: disconnected
[2013-04-17 21:07:47.266289] W
    [client-rpc-fops.c:1640:client3_3_entrylk_cbk] 0-gfs33-client-2:
    remote operation failed: Socket is not connected

In socket.c, EOF is decided because ret is 0. ret may come from
iov_load() or from readv(). I have not yet determined who is the
culprit.

On the brick side, I get this:
[2013-04-17 21:07:47.208168] E
   [event-poll.c:346:event_dispatch_poll_handler] 0-poll: index 
   not found for fd=8 (idx_hint=5)

A tcpdump running at that time on brick side reports a TCP RST at
22:07:47.208163. I recall there glusterfs does not use local time,
therefore I think it should be 21:07:47.208163 for glusterfs. 

There is also a small clock skew between client (offset -0.000732) and
brick (-0.006740), which means brick is 6008 µs behind the client. That
means the TCP reset happens after the ret = 0 in socket.c:487, as I
understand. I therefore strongly suspect iov_load().

Opinions? Any hint?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@xxxxxxxxxx