Re: One node goes offline, the other node loses its connection to its local Gluster volume

Joe Julian <joe@xxxxxxxxxxxxxxxx> · Thu, 06 Mar 2014 14:18:13 -0800



    On 02/22/2014 05:44 PM, Greg Scott
      wrote:

    
      I have 2 nodes named fw1 and fw2.  When I
        ifdown the NIC I’m using for Gluster on either node, that node
        cannot see  its Gluster volume, but the other node can see it
        after a timeout.  As soon as I ifup that NIC, everyone can see
        everything again.  
       
      Is this expected behavior?  When that
        interconnect drops, I want both nodes to see their own local
        copy and then sync everything back up when the interconnect
        connects again. 
      
    
    If a client loses communication on an open tcp connection to a
    server, there is a timeout period (defaults to 42 seconds) where the
    client waits for the communication to continue as dropping and
    re-establishing hundreds to potentially tens of thousands of file
    descriptors and locks is a very expensive process, disruptive to the
    entire environment.

    
    With the test process you're describing, the clients are connected
    to both servers (hopefully based on hostname resolution) ip
    addresses on the same network. When you down a nic, that address is
    no longer available. Not only can the remote client not connect to
    it, but your local client cannot as well as the address no longer
    exists.

    
    In your real-life concern, the interconnect would not interfere with
    the existence of either machines' ip address so after the
    ping-timeout, operations would resume in a split-brain
    configuration. As long as no changes were made to the same file on
    both volumes, when the connection is reestablished, the self-heal
    will do exactly what you expect.

    
    However.... what you're counting on is the most common cause of
    split-brain. Each client connected to one server independently
    modifies the same file. When the connection is reestablished, the
    self-heal is processed and that file is marked as split-brain -
    inaccessible from the client mount until it's resolved by admin
    intervention.

    
    You can avoid the split-brain using a couple of quorum techniques,
    the one that would seem to satisfy your requirements leaving your
    volume read-only during the duration of the outage.

  
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users