One node goes offline, the other node can't see the replicated volume anymore

joe at julianfamily.org (Joe Julian) · Sat, 13 Jul 2013 09:22:32 -0700

No, they're equal peers. Each client connects to both servers after retrieving the configuration from the server specified in the mount command.

When a server shuts down, the TCP connection is properly closed and the clients continue to operate with the remaining servers. In a replicated volume that means without any missing data.

When the TCP connection is not closed, the client will attempt to reach the missing server for 42 (network.ping-timeout) seconds. The filesystem appears frozen during that timeout. Once timed out, the client should continue as above.

Your logs, however, say that the client has lost connection with ALL the servers. What I've seen in your logs so far, however, don't show both disconnects. I've only seen the last. If you'll follow my instructions, I can get a clearer picture of what's going wrong.

This is one of the reasons I hate mailing lists and do most of my support via IRC. On IRC there's not these hours or days long delays between. We're generally able to solve the worst problems in at few hours so I feel I am making a difference.

Anyway, follow my complete instructions and I'll help you further. I'm sure we can figure this out.

Greg Scott <GregScott at infrasupport.com> wrote:

>I was out all day yesterday - is there anything I can do to fix this
>problem or is this just pretty much how Gluster works?
>
>- Greg
>
>
>-----Original Message-----
>From: gluster-users-bounces at gluster.org
>[mailto:gluster-users-bounces at gluster.org] On Behalf Of Greg Scott
>Sent: Thursday, July 11, 2013 6:44 PM
>To: 'Joe Julian'
>Cc: 'gluster-users at gluster.org'
>Subject: Re: One node goes offline, the other node
>can't see the replicated volume anymore
>
>So back to the problem at hand - I think what's going on is, both nodes
>fw1 and fw2 try to satisfy reads from fw1 first.  That's why fw2 can't
>find the /firewall-scripts file system when it becomes isolated from
>fw1, and why fw1 always seems to be able to find it.  What makes fw1 so
>important?  Near as I can tell, because fw1 is the first in the list
>and I used node fw1 to set up my Gluster volume.  
>
>So after putting up with me for page after page of text digging into
>the problem details, is there anything we can do to tell Gluster to
>satisfy reads locally, especially when the other brick is offline?  
>
>Thanks
>
>- Greg
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://supercolony.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130713/cbd9ea19/attachment.html>