Hi All, Current Desgin and its limitations: Geo-replication syncs changes across geography using changelogs captured by changelog translator. Changelog translator sits on server side just above posix translator. Hence, in distributed replicated setup, both replica pairs collect changelogs w.r.t their bricks. Geo-replication syncs the changes using only one brick among the replica pair at a time, calling it as "ACTIVE" and other non syncing brick as "PASSIVE". Let's consider below example of distributed replicated setup where NODE-1 as b1 and its replicated brick b1r is in NODE-2 NODE-1 NODE-2 b1 b1r At the beginning, geo-replication chooses to sync changes from NODE-1:b1 and NODE-2:b1r will be "PASSIVE". The logic depends on virtual getxattr 'trusted.glusterfs.node-uuid' which always returns first up subvolume i.e., NODE-1. When NODE-1 goes down, the above xattr returns NODE-2 and that is made 'ACTIVE'. But when NODE-1 comes back again, the above xattr returns NODE-1 and it is made 'ACTIVE' again. So for a brief interval of time, if NODE-2 had not finished processing the changelog, both NODE-2 and NODE-1 will be ACTIVE causing rename race as below. https://bugzilla.redhat.com/show_bug.cgi?id=1140183 SOLUTION: Don't make NODE-2 'PASSIVE' when NODE-1 comes back again untill NODE-2 goes down. APPROACH TO SOLVE WHICH I CAN THINK OF: Have a distributed store of a file, which captures the bricks which are active. When a NODE goes down, the file is updated with it's replica bricks making sure, at any point in time, the file has all the bricks to be made active. Geo-replication worker process is made 'ACTIVE' only if it is in the file. Implementation can be in two ways: 1. Have a distributed store for above implementation. This needs to be thought of as distributed store is not in place in glusterd yet. 2. Other solution is to store in a file similar to existing glusterd global configuration file (/var/lib/glusterd/options). When this file is updated, version number is incremented. When the node which is gone down, comes up, gets this file from peers if it's version number is less that of peers. I did a POC with second approach storing list of active bricks 'NodeUUID:brickpath' in options file itself. It seems to work fine except the bug in glusterd where the daemons are getting spawned before the node gets 'options' file from other node during handshake. CHANGES IN GLUSTERD: When a node goes down, all the other nodes are notified through glusterd_peer_rpc_notify, where, it needs to find the replicas of the node which went down and update the global file. PROBLEMS/LIMITATIONS WITH THIS APPRAOCH: 1. If glusterd is killed and the node is still up, this makes the other replica 'ACTIVE'. So both replica bricks will be syncing at this point of time which is not expected. 2. If the single brick process is killed, it's replica brick is not made 'ACTIVE'. Glusterd/AFR folks, 1. Do you see a better approach other than above to solve this issue? 2. Is this approach feasible? If yes, how can I handle the problems mentioned above ? 3. Is this approach feasible from scalability point of view since complete list of active brick path is stored and read by gsyncd ? 3. Does this approach fits into three way replication and erasure coding? Thanks and Regards, Kotresh H R _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel