One node goes offline, the other node can't see the replicated volume anymore

joe at julianfamily.org (Joe Julian) · Sat, 13 Jul 2013 17:37:36 -0700

These logs show different results. The results you reported and pasted 
earlier included, "[2013-07-09 00:59:04.706390] I 
[afr-common.c:3856:afr_local_init] 0-firewall-scripts-replicate-0: no 
subvolumes up", which would produce the "Transport endpoint not 
connected" error you reported at first. These results look normal and 
should have produced the behavior I described.

42 is The Answer to Life, The Universe, and Everything.

Re-establishing FDs and locks is an expensive operation. The 
ping-timeout is long because it should not happen, but if there is 
temporary network congestion you'd (normally) rather have your volume 
remain up and pause than have to re-establish everything. Typically, 
unless you expect your servers to crash often, leaving ping-timeout at 
the default is best. YMMV and it's configurable in case you know what 
you're doing and why.

On 07/13/2013 04:58 PM, Greg Scott wrote:
>
> Log files sent privately to Joe.  If others from the community want to 
> look at them, I?m OK with posting them here.  I don?t think they have 
> anything confidential.  Now that I know about that 42 second timeout, 
> the behavior makes more sense.   Why 42?   What?s special about 42?  
>  Is there a way I adjust that down for my application to, say, 1 or 2 
>  seconds?
>
> -Greg
>
> *From:*Joe Julian [mailto:joe at julianfamily.org]
> *Sent:* Saturday, July 13, 2013 4:28 PM
> *To:* Greg Scott; 'gluster-users at gluster.org'
> *Subject:* Re: One node goes offline, the other node 
> can't see the replicated volume anymore
>
> Huh.. this was in my sent folder... let's try again.
>
> There's something missing from this picture. The logs show that the 
> client is connecting to both servers, but it only shows the 
> disconnection from one and claims that it's not connected to any 
> bricks after that.
>
> Here's the data I'd like to have you generate:
>
> unmount the clients
> gluster volume set firewall-scripts diagnostics.client-log-level DEBUG
> gluster volume set firewall-scripts diagnostics.brick-log-level DEBUG
> systemctl stop glusterd.service
> truncate the client, glusterd, and server logs
> systemctl start glusterd
> mount /firewall-scripts
> Do your iptables disconnect
> telnet $this_host_ip 24007 # report whether or not it establishes a 
> connection
> ls /firewall-scripts
> wait 42 seconds
> ls /firewall-scripts
> Remove the iptables rule
> ls /firewall-scripts
> tar up the logs and email them to me.
>
> You can reset the log-level:
>
> gluster volume reset firewall-scripts diagnostics.client-log-level
> gluster volume reset firewall-scripts diagnostics.brick-log-level
>
> lastly, do you have a loopback interface (lo) on 127.0.0.1 and is 
> localhost defined in /etc/hosts?
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130713/eb1a5403/attachment-0001.html>