Replication not working on server hang

skraw at ithnet.com (Stephan von Krawczynski) · Fri, 28 Aug 2009 13:32:48 +0200

> [...]
> Glusterfs log only shows lines like this ones:
> 
> [2009-08-28 09:19:28] E [client-protocol.c:292:call_bail] data2: bailing 
> out frame LOOKUP(32) frame sent = 2009-08-28 08:49:18. frame-timeout = 1800
> [2009-08-28 09:23:38] E [client-protocol.c:292:call_bail] data2: bailing 
> out frame LOOKUP(32) frame sent = 2009-08-28 08:53:28. frame-timeout = 1800
> 
> Once server2 has been rebooted all gluster fs become available
> again on all clients and the hanged df and ls processes terminate,
> but difficult to understand why a replicated share that must survive
> to failure on one server does not.

You are suffering from the problem we talked about few days ago on the list.
If your local fs produces a deadlock somehow on one server glusterfs is
currently unable to cope with the situation and just _waits_ for things to
come. This deadlocks your clients, too, without any need.
Your experience backs my critics on the handling of these situations.

-- 
Regards,
Stephan