Issue detecting dead peer

"Kemp, Joseph A. (JKEMP)" <JKEMP@xxxxxxxxx> · Wed, 5 Feb 2014 18:28:18 +0000

I am running some tests using two kvm hosts each with a centos 6.5 instance running
gluster 3.4.2.  
The gluster instances are acting both as a 
gluster server and client,  mounting the
gluster volume they are also serving. 
During my test there is no file access occurring on the 
gluster volume.  

I am seeing an issue when I forcibly disconnect node1 from the network.
 Node2 can take several minutes before it detects node1 is disconnected. 
During this time on node2 running “gluster peer status” shows node1 as connected. 
The first run of “gluster volume status” takes two minutes to timeout and then returns with no output. 
Subsequent runs of “gluster volume status” returns quickly with “Another transaction is in progress. Please try again after sometime.” 
Eventually “gluster peer status” will show node1 as disconnected. 
At that point “gluster volume status” starts to return quickly.

This behavior is only seen when I do a “service network stop” on node1 to simulate a node failure. If I do a “service
glusterd stop” on node1 to cleanly shutdown 
gluster, node2 sees node1 being disconnected immediately. 
The volume status commands return immediately.

What is the mechanism for a node to detect a peer has failed? 
The delay I am seeing is worrisome to deal with in a production environment.

Thanks,
-Joe

System Administration
ARINC Direct
410-266-4028

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users