Hi All, The logic discussed in previous mail thread is not feasible. So in order to solve the Active/Passive switching in geo-replication, following new idea is thought off. 1. Have a shared storage, a glusterfs management volume specific to geo-replication. 2. Use fcntl lock on a file stored on above said shared volume. There will be one file per replica set. Each worker tries to lock the file on shared storage, who ever wins will be ACTIVE. With this, we are able to solve the problem but there is an issue when the shared storage goes down (if it is replica, when all replicas goes down). In that case, the lock state is lost. But if we use sanlock, as ovirt uses, I think the above problem of lock state being lost could be solved ? https://fedorahosted.org/sanlock/ If anybody have used sanlocks, is it a good option in this respect ? Please share your thoughts, suggestions on this. Thanks and Regards, Kotresh H R ----- Original Message ----- > From: "Kotresh Hiremath Ravishankar" <khiremat@xxxxxxxxxx> > To: gluster-devel@xxxxxxxxxxx > Sent: Monday, December 22, 2014 10:53:34 AM > Subject: A new approach to solve Geo-replication ACTIVE/PASSIVE switching in distributed replicate > setup! > > Hi All, > > Current Desgin and its limitations: > > Geo-replication syncs changes across geography using changelogs > captured > by changelog translator. Changelog translator sits on server side just > above posix > translator. Hence, in distributed replicated setup, both replica pairs > collect > changelogs w.r.t their bricks. Geo-replication syncs the changes using only > one > brick among the replica pair at a time, calling it as "ACTIVE" and other > non syncing > brick as "PASSIVE". > > Let's consider below example of distributed replicated setup where > NODE-1 as b1 and its replicated brick b1r is in NODE-2 > > NODE-1 NODE-2 > b1 b1r > > At the beginning, geo-replication chooses to sync changes from NODE-1:b1 > and > NODE-2:b1r will be "PASSIVE". The logic depends on virtual getxattr > 'trusted.glusterfs.node-uuid' which always returns first up subvolume i.e., > NODE-1. > When NODE-1 goes down, the above xattr returns NODE-2 and that is made > 'ACTIVE'. > But when NODE-1 comes back again, the above xattr returns NODE-1 and it is > made > 'ACTIVE' again. So for a brief interval of time, if NODE-2 had not finished > processing > the changelog, both NODE-2 and NODE-1 will be ACTIVE causing rename race > as below. > > https://bugzilla.redhat.com/show_bug.cgi?id=1140183 > > > SOLUTION: > Don't make NODE-2 'PASSIVE' when NODE-1 comes back again untill NODE-2 > goes down. > > > APPROACH TO SOLVE WHICH I CAN THINK OF: > > Have a distributed store of a file, which captures the bricks which are > active. > When a NODE goes down, the file is updated with it's replica bricks making > sure, at any point in time, the file has all the bricks to be made active. > Geo-replication worker process is made 'ACTIVE' only if it is in the file. > > Implementation can be in two ways: > > 1. Have a distributed store for above implementation. This needs to be > thought > of as distributed store is not in place in glusterd yet. > > 2. Other solution is to store in a file similar to existing glusterd global > configuration file (/var/lib/glusterd/options). When this file is > updated, > version number is incremented. When the node which is gone down, comes > up, > gets this file from peers if it's version number is less that of peers. > > > I did a POC with second approach storing list of active bricks > 'NodeUUID:brickpath' > in options file itself. It seems to work fine except the bug in glusterd > where the > daemons are getting spawned before the node gets 'options' file from other > node during > handshake. > > CHANGES IN GLUSTERD: > When a node goes down, all the other nodes are notified through > glusterd_peer_rpc_notify, > where, it needs to find the replicas of the node which went down and update > the global > file. > > PROBLEMS/LIMITATIONS WITH THIS APPRAOCH: > 1. If glusterd is killed and the node is still up, this makes the other > replica 'ACTIVE'. > So both replica bricks will be syncing at this point of time which is > not expected. > > 2. If the single brick process is killed, it's replica brick is not made > 'ACTIVE'. > > > Glusterd/AFR folks, > > 1. Do you see a better approach other than above to solve this issue? > 2. Is this approach feasible? If yes, how can I handle the problems > mentioned above ? > 3. Is this approach feasible from scalability point of view since > complete list of active > brick path is stored and read by gsyncd ? > 3. Does this approach fits into three way replication and erasure coding? > > > > Thanks and Regards, > Kotresh H R > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel