Re: Quorum in distributed-replicate volume

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Mon, Feb 26, 2018 at 6:14 PM, Dave Sherohman <dave@xxxxxxxxxxxxx> wrote:
On Mon, Feb 26, 2018 at 05:45:27PM +0530, Karthik Subrahmanya wrote:
> > "In a replica 2 volume... If we set the client-quorum option to
> > auto, then the first brick must always be up, irrespective of the
> > status of the second brick. If only the second brick is up, the
> > subvolume becomes read-only."
> >
> By default client-quorum is "none" in replica 2 volume.

I'm not sure where I saw the directions saying to set it, but I do have
"cluster.quorum-type: auto" in my volume configuration.  (And I think
that's client quorum, but feel free to correct me if I've misunderstood
the docs.)
If it is "auto" then I think it is reconfigured. In replica 2 it will be "none".

> It applies to all the replica 2 volumes even if it has just 2 brick or more.
> Total brick count in the volume doesn't matter for the quorum, what matters
> is the number of bricks which are up in the particular replica subvol.

Thanks for confirming that.

> If I understood your configuration correctly it should look something like
> this:
> (Please correct me if I am wrong)
> replica-1:  bricks 1 & 2
> replica-2: bricks 3 & 4
> replica-3: bricks 5 & 6

Yes, that's correct.

> Since quorum is per replica, if it is set to auto then it needs the first
> brick of the particular replica subvol to be up to perform the fop.
>
> In replica 2 volumes you can end up in split-brains.

How would that happen if bricks which are not in (cluster-wide) quorum
refuse to accept writes?  I'm not seeing the reason for using individual
subvolume quorums instead of full-volume quorum.
Split brains happen within the replica pair.
I will try to explain how you can end up in split-brain even with cluster wide quorum:
Lets say you have 6 bricks (replica 2) volume and you always have at least quorum number of bricks up & running.
Bricks 1 & 2 are part of replica subvol-1
Bricks 3 & 4 are part of replica subvol-2
Bricks 5 & 6 are part of replica subvol-3

- Brick 1 goes down and a write comes on a file which is part of that replica subvol-1
- Quorum is met since we have 5 out of 6 bricks are running
- Brick 2 says brick 1 is bad
- Brick 2 goes down and brick 1 comes up. Heal did not happened
- Write comes on the same file, quorum is met, and now brick 1 says brick 2 is bad
- When both the bricks 1 & 2 are up, both of them blame the other brick - *split-brain*

> It would be great if you can consider configuring an arbiter or
> replica 3 volume.

I can.  My bricks are 2x850G and 4x11T, so I can repurpose the small
bricks as arbiters with minimal effect on capacity.  What would be the
sequence of commands needed to:

1) Move all data off of bricks 1 & 2
2) Remove that replica from the cluster
3) Re-add those two bricks as arbiters
 
(And did I miss any additional steps?)

Unfortunately, I've been running a few months already with the current
configuration and there are several virtual machines running off the
existing volume, so I'll need to reconfigure it online if possible.
Without knowing the volume configuration it is difficult to suggest the configuration change,
and since it is a live system you may end up in data unavailability or data loss.
Can you give the output of "gluster volume info <volname>"
and which brick is of what size.
Note: The arbiter bricks need not be of bigger size.
[1] gives information about how you can provision the arbiter brick.

[1] http://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/#arbiter-bricks-sizing

Regards,
Karthik

--
Dave Sherohman

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux