So I have testbed composed of a simple 2+1 replicate 3 with ARB testbed.
gluster1, gluster2 and gluster-arb (with shards)
My testing involves some libvirt VMs running continuous write fops on a
localhost fuse mount on gluster1
Works great when all the pieces are there. Once I figured out the shard
tuning, I was really happy with the speed, even with the older kit I was
using for the testbed. Sharding is a huge win.
So for Failure testing I found the following:
If you take down the ARB, the VMs continue to run perfectly and when the
ARB returns it catches up.
However, if you take down Gluster2 (with the ARB still being up) you
often (but not always) get a write lock on one or more of the VMs, until
Gluster2 recovers and heals.
Per the Docs, this Write Lock is evidently EXPECTED behavior with an
Arbiter to avoid a Split-Brain.
As I understand it, if the Arb thinks that it knows about (and agrees
with) data that exists on Gluster2 (now down) that should be written to
Gluster1, it will write lock the volume because the ARB itself doesn't
have that data and going forward is problematic until Gluster2's data
is back in the cluster and can bring the volume back into proper sync.
OK, that is the reality of using an Rep2 + ARB versus a true Rep3
environment. You get Split-Brain protection but not much increase in HA
over old school Replica 2.
So I have some questions:
a) In the event that gluster2 had died and we have entered this write
lock phase, how does one go forward if the Gluster2 outage can't be
immediately (or remotely) resolved?
At that point I have some hung VMs and annoyed users.
The current quorum settings are:
# gluster volume get VOL all | grep 'quorum'
cluster.quorum-type auto
cluster.quorum-count 2
cluster.server-quorum-type server
cluster.server-quorum-ratio 0
cluster.quorum-reads no
Do I simply kill the quorum and and the VMs will continue where they
left off?
gluster volume set VOL cluster.server-quorum-type none
gluster volume set VOL cluster.quorum-type none
If I do so, should I also kill the ARB (before or after)? or leave it up
Or should I switch to quorum-type fixed with a quorum count of 1?
b) If I WANT to take down Gluster2 for maintenance, how do I prevent the
quorum write-lock from occurring.
I suppose I could fiddle with the quorum settings as above, but I'd like
to be able to PAUSE/FLUSH/FSYNC the Volume before taking down Gluster2,
then unpause and let the volume continue with Gluster1 and the ARB
providing some sort of protection and to help when Gluster2 is returned
to the cluster.
c) Does any of the above behaviour change when I switch to GFAPI
Sincerely
-bill
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users