Thanks to everyone for their replies... On Tue, Feb 11, 2014 at 2:37 AM, Kaushal M <kshlmster@xxxxxxxxx> wrote: > The 42 second hang is most likely the ping timeout of the client translator. Indeed I think it is... > > What most likely happened was that, the brick on annex3 was being used > for the read when you pulled its plug. When you pulled the plug, the > connection between the client and annex3 isn't gracefully terminated > and the client translator still sees the connection as alive. Because > of this the next fop is also sent to annex3, but it will timeout as > annex3 is dead. After the timeout happens, the connection is marked as > dead, and the associated client xlator is marked as down. Since afr > now know annex3 is dead, it sends the next fop to annex4 which is > still alive. I think this sounds right... My thought was that maybe Gluster could do better somehow. For example, if the timeout counter passes (say 1 sec) it immediately starts looking for a different brick to continue from. This way a routine failover wouldn't interrupt activity for 42 seconds. Maybe this is a feature that could be part of the new style replication? > > These kinds of unclean connection terminations are only handled by > request/ping timeouts currently. You could set the ping timeout values > to be lower, to reduce the detection time. The reason I don't want to set this value significantly lower, is that in the case of a _real_ disaster, or high load condition, I want to have the 42 seconds to give things a chance to recover without having to kill the "in process" client mount. So it makes sense to keep it like this. > > ~kaushal Cheers, James _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users