Re: HA replica

Joe Julian <joe@xxxxxxxxxxxxxxxx> · Wed, 17 Feb 2016 20:03:19 -0800



    On 02/12/2016 12:08 PM, Mike Stump wrote:

    
      Ok. I’m a new user, I want to make an array with 10 machines. I
      want to be able to able to suffer the loss of any one machine. I
      don’t mind wasting 50% of the disk space to do this. I don’t want
      to suffer split brain. I want the array to support both read and
      write access to data.
      How do I achieve that?
    
    
    What is your acceptable annual downtime (typically outlined in an
    SLA or OLA)? That's a bit of information you should have when you're
    engineering a system.

    
    Split-brain happens when your replication has been partitioned and
    writes have occurred in such a way that no valid copy can be
    discerned. For the sake of example, we're going to use a very simple
    file entitled "file.txt" with the contents of "The quick brown fox
    jumped over the lazy yellow dog." It exists on a replicated volume
    with no protection on a network where a server and client are in the
    west wing, and the replica server and another client are in the east
    wing. Somewhere in the middle, someone pulls the plug on the router.
    The west client can see the west server and the east client can see
    the east server.

    
    The west client updates file.txt changing the word "brown" to "red".
    The east client updates the same file.txt and changes the word
    "brown" to "white".

    
    The router recovers and the two servers try to synchronize any files
    that were changed. They both had changes to file.txt. Which one was
    right?

    
    There's no way to determine that from the information given. That's
    split-brain.

    
    How can you combat split brain? 

    
    One solution is quorum. Have enough replica that comparisons can be
    made. If two servers are in the west and only one in the east and
    they have the ability to determine quorum, the east server will not
    allow writes during the network split. It can tell that it's not
    safe because if they all three voted on which change was right, the
    two in the west would win and data would be lost. The two in the
    west see that one server is lost, but they still have quorum. They
    allow the data to remain available, knowing that the out-of-quorum
    server is safe from changes.

    
    Gluster has the ability to have a minimally participating quorum
    participant called an arbiter. Let's make the west client an
    arbiter. The net split happens. Only the two replica exist, one in
    west and the other in east. The arbiter can see the west server but
    not the east. The east server can see neither the west server nor
    the arbiter. The east loses quorum but the west, seeing the arbiter,
    does still have quorum and remains available with the safe
    understanding that the east server, not having quorum, will not
    accept writes.

    
    So with your 10 servers you could have a "replica 3 arbiter 1"
    volume with one of the replica being an arbiter. It would only use
    space for file names and metadata, but no actual data. If I were
    doing it, I would probably do it as so:

    
        gluster volume create myvol replica 3 arbiter 1 server1:/brick1
    server2:/brick1 server3:/arbiter \

        server3:/brick1 server4:/brick1 server5:/arbiter etc.

    
    Notice how there's both a data directory (/brick1) and an arbiter
    directory (/arbiter) on bricks 3,5,7... which allows the data
    "waste" that you're asking for while mostly allowing the
    availability you seek. I say mostly because if your network
    partitions, something's got to give or you will lose data. There's
    absolutely no way for disconnected systems to coordinate binary
    changes to each other with today's technology. 

    
    Perhaps, one day, we will have quantum tunneling networks with
    superimposed particles able to teleport data without the need of
    networks, but that's not today. When that is available, I
    expect rainbows and unicorns to be available as well.

  
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users