Client hang on HA config when restoring a server

Kevan Benson <kbenson@xxxxxxxxxxxxxxx> · Wed, 8 Aug 2007 12:32:54 -0700

Same config from previous post:
System configs at http://glusterfs.pastebin.com/m52564c56
Server A: 172.16.1.81
Server B: 172.16.1.82
Client A: 172.16.1.85
Client B: 172.16.1.86

I have noticed in this config thought that sometimes on restoring a
failed server, one of the clients will not by able to list the mount still
(reports error "ls: /mnt/glusterfs/: Transport endpoint is not
connected") until I restart the client on that system.  In this case,
the system goes from both clients working with one failed server to
one client NOT working with both servers up.  The client fails at the
point that the active server reconnects to the inactive server.

The last thing the client log shows is this:
2007-08-08 12:25:24 D [client-protocol.c:4218:client_protocol_reconnect] share: breaking reconnect chain

And the server log (on the only server showing activity when request
is made) shows this for every ls request:
2007-08-08 12:27:11 E [unify.c:337:unify_lookup] share: : Argument not right

As stated above, restarting the client fixes this problem.  The other
client in this setup has not problem, but it's random (to me at least)
which client has the problem.

-- 
- Kevan Benson
- A-1 Networks