Re: Single Point of failure in geo Replication

Aravinda Vishwanathapura Krishna Murthy <avishwan@xxxxxxxxxx> · Thu, 17 Oct 2019 21:03:43 +0530

On Thu, Oct 17, 2019 at 11:44 AM deepu srinivasan <sdeepugd@xxxxxxxxx> wrote:
Thank you for your response. 
We have tried the above use case you mentioned.
Case 1: Primary node is permanently Down (Hardware failure)
In this case, the Georeplication session cannot be stopped and returns the failure "start the primary node and then stop(or similar message)".
Now I cannot delete because I cannot stop the session.

Please try "stop force", Let us know if that works.

On Thu, Oct 17, 2019 at 8:32 AM Aravinda Vishwanathapura Krishna Murthy <avishwan@xxxxxxxxxx> wrote:

On Wed, Oct 16, 2019 at 11:08 PM deepu srinivasan <sdeepugd@xxxxxxxxx> wrote:
Hi UsersIs there a single point of failure in GeoReplication for gluster? 
My Case: 
I Use 3 nodes in both master and slave volume.
Master volume : Node1,Node2,Node3
Slave Volume : Node4,Node5,Node6
I tried to recreate the scenario to test a single point of failure.

Geo-Replication Status:

Master Node         Slave Node         Status 
Node1                   Node4                  Active
Node2                   Node4                  Passive
Node3                   Node4                  Passive

Step 1: Stoped the glusterd daemon in Node4.
Result: There were only two-node statuses like the one below.

Master Node         Slave Node         Status 
Node2                   Node4                  Passive
Node3                   Node4                  Passive

Will the GeoReplication session goes down if the primary slave is down?

Hi Deepu,

Geo-replication depends on a primary slave node to get the information about other nodes which are part of Slave Volume.

Once the workers are started, it is not dependent on the primary slave node. Will not fail if a primary goes down. But if any other node goes down then the worker will try to connect to some other node, for which it tries to run Volume status command on the slave node using the following command.

```
ssh -i <georep-pem> <primary-node> gluster volume status <slavevol>
```

The above command will fail and Worker will not get the list of Slave nodes to which it can connect to.

This is only a temporary failure until the primary node comes back online. If the primary node is permanently down then run Geo-rep delete and Geo-rep create command again with the new primary node. (Note: Geo-rep Delete and Create will remember the last sync time and resume once it starts)

I will evaluate the possibility of caching a list of Slave nodes so that it can be used as a backup primary node in case of failures. I will open Github issue for the same.

Thanks for reporting the issue.

-- 
regards
Aravinda VK

-- 
regards
Aravinda VK

________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users