Re: Avoid split-brains on replica-2

Ravishankar N <ravishankar@xxxxxxxxxx> · Mon, 11 May 2015 10:13:36 +0530

On 05/11/2015 12:46 AM, Christopher Pereira wrote:
Hi,

Using a replica-2 gluster volume with oVirt is currently not supported 
and causes split-brains, specially due to sanlock.

Replica-1 works fine (since split-brains are impossible here) and 
replica-3 *should* work or at least reduces the probability of 
suffering split-brains (depends on race-conditions?).

Replica 3 definitely reduces  the probability of split-brains while 
providing better availability than replica 2.  The sanlock bug 
(https://bugzilla.redhat.com/show_bug.cgi?id=1066996) was fixed quite 
some time back.

Besides, geo-replication allows to replicate a replica-1 volume in 
order to achieve similar results as replica-2.
But since geo-rep uses rsync I guess that it's less optimal than using 
"replica-n" where I guess blocks are marked as dirty to be replicated. 
Does geo-rep do the same?
How does replica-n and geo-rep compare in a continuous replication 
scenario?
How safe is it to use replica-n or geo-rep for VM images? Will the 
replicated VM images be mostly consistent compared to a bare-metal 
sudden power-off?

My guess is that replica-n is safer than geo-rep since it replicates 
writes synchronically in real-time, while geo-rep seems to do an 
initial scan using rsync, but I'm not sure how it continues 
replicating after that initial sync.

Anyway, I would like to ask, discuss or propose the following idea:
- Have an option to tell gluster to write to only one brick 
(split-brains would be impossible) which will then replicate to other 
bricks.
- A local brick (if exists) should be selected as the "write authority 
brick".

If that one brick goes down, data previously written cannot be served 
until the 'sync' has been completed to the other brick. Also, new writes 
would not be possible until the brick comes up no?

This would increase the global write performance which is currently 
constrained to the slowest node because writes are currently 
replicated synchronically to all other replicas (=> writes are not 
scalable for replica volumes).

There is an optimization that is on the cards where we write to all 
bricks of the replica synchronously but we return success to the 
application as soon as we get success from any one of the bricks. 
(Currently we wait for replies from all bricks before returning 
success/failure to the upper translator).

Basically, the idea here is to have an option to avoid split-brains 
selecting an authority brick and to avoid sync writes.
The same goal could be achieved by forcing gluster to resolve *all* 
split-brains by choosing the authority brick as the winner (?).

Do we currently have an option for doing something like this?

There is an arbiter feature for replica 3 volumes 
(https://github.com/gluster/glusterfs/blob/master/doc/features/afr-arbiter-volumes.md) 
being released in glusterfs 3.7 which would prevent files from going 
into split-brains, you could try that out. If the writes can cause a 
split-brain, it is failed with an ENOTCONN to the application.

Regards,
Ravi

Benefits for gluster:
- replica-n won't cause no split-brains
- scalability (write performance won't be limited to the slowed node)

Best regards,
Christopher Pereira

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel