Re: [Release-8] Thin-Arbiter: Unique-ID requirement

Amar Tumballi <amar@xxxxxxxxx> · Tue, 4 Feb 2020 23:07:35 +0530

On Tue, Jan 14, 2020 at 2:37 PM Atin Mukherjee <atin.mukherjee83@xxxxxxxxx> wrote:
From a design perspective 2 is a better choice. However I'd like to see a design on how cluster id will be generated and maintained (with peer addition/deletion scenarios, node replacement etc).

Thanks for the feedback Atin.

On Tue, Jan 14, 2020 at 1:42 PM Amar Tumballi <amar@xxxxxxxxx> wrote:
Hello,
As we are gearing up for Release-8, and its planning, I wanted to bring up one of my favorite topics, 'Thin-Arbiter' (or Tie-Breaker/Metro-Cluster etc etc).

We have made thin-arbiter release in v7.0 itself, which works great, when we have just 1 cluster of gluster. I am talking about a situation which involves multiple gluster clusters, and easier management of thin-arbiter nodes. (Ref: https://github.com/gluster/glusterfs/issues/763)

I am working with a goal of hosting a thin-arbiter node service (free of cost), for which any gluster deployment can connect, and save their cost of an additional replica, which is required today to not get into split-brain situation. Tie-breaker storage and process needs are so less that we can easily handle all gluster deployments till date in just one machine. When I looked at the code with this goal, I found that current implementation doesn't support it, mainly because it uses 'volumename' in the file it creates. This is good for 1 cluster, as we don't allow duplicate volume names in a single cluster, or OK for multiple clusters, as long as volume names are not colliding.

To resolve this properly we have 2 options (as per my thinking now) to make it truly global service.

1. Add 'volume-id' option in afr volume itself, so, each instance picks the volume-id and uses it in thin-arbiter name. A variant of this is submitted for review - https://review.gluster.org/23723 but as it uses volume-id from io-stats, this particular patch fails in case of brick-mux and shd-mux scenarios.  A proper enhancement of this patch is, providing 'volume-id' option in AFR itself, so glusterd (while generating volfiles) sends the proper vol-id to instance. 

Pros: Minimal code changes to the above patch.
Cons: One more option to AFR (not exposed to users).

2. Add cluster-id to glusterd, and pass it to all processes. Let replicate use this in thin-arbiter file. This too will solve the issue.

Pros: A cluster-id is good to have in any distributed system, specially when there are deployments which will be 3 node each in different clusters. Identifying bricks, services as part of a cluster is better.

Cons: Code changes are more, and in glusterd component.

On another note, 1 above is purely for Thin-Arbiter feature only, where as 2nd option would be useful in debugging, and other solutions which involves multiple clusters.

Let me know what you all think about this. This is good to be discussed in next week's meeting, and taken to completion.

After some more code reading, and thinking about possible solutions, I found that there is another simpler solution to get this resolved for multiple cluster.

Currently thin-arbiter file name for a replica-set is picked from what is the 3rd (ie, index=2) option in 'pending-xattr' key in volume file. If we get that key to be unique (say volume-id + index-of-replica-set), this problem is solved. Needs minimum change in code for glusterfs (actually, no code change in filesystem part, but only in glusterd-volgen.c).

I tried this approach while providing replica2 option of kadalu.io project. The tests are running fine, and I got the expected goal met. 

<snip>
 I am working with a goal of hosting a thin-arbiter node service (free of cost), for which any gluster deployment can connect, and save their cost of an additional replica, which is required today to not get into split-brain situation. 
</snip>

I am happy to tell, this goal is achieved. We now have `tie-breaker.kadalu.io:/mnt`, an instance in cloud, for anyone trying to use a thin-arbiter. If you are not keen to deploy your own instance, you can use this as thin-arbiter instance. Note that if you are using glusterfs releases, you may want to wait for patch https://review.gluster.org/24096 to make it to a release (probably 7.3/7.4) to use this in production, till that time, volume-files generated by glusterd volgen are still using volumename itself in pending-xattr, hence possible collision of files.

Regards,

Regards,
Amar
---
https://kadalu.io
Storage made easy for k8s

_______________________________________________

Community Meeting Calendar:

APAC Schedule -

Every 2nd and 4th Tuesday at 11:30 AM IST

Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -

Every 1st and 3rd Tuesday at 01:00 PM EDT

Bridge: https://bluejeans.com/441850968

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-devel

-- 
--https://kadalu.io
Container Storage made easy!

_______________________________________________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel