A new approach to solve Geo-replication ACTIVE/PASSIVE switching in distributed replicate setup!

Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> · Mon, 22 Dec 2014 00:23:34 -0500 (EST)

Hi All,

Current Desgin and its limitations:

        Geo-replication syncs changes across geography using changelogs captured
  by changelog translator. Changelog translator sits on server side just above posix
  translator. Hence, in distributed replicated setup, both replica pairs collect
  changelogs w.r.t their bricks. Geo-replication syncs the changes using only one
  brick among the replica pair at a time, calling it as "ACTIVE" and other non syncing
  brick as "PASSIVE".

         Let's consider below example of distributed replicated setup where
         NODE-1 as b1 and its replicated brick b1r is in NODE-2

                        NODE-1                         NODE-2
                          b1                            b1r

  At the beginning, geo-replication chooses to sync changes from NODE-1:b1 and 
  NODE-2:b1r will be "PASSIVE". The logic depends on virtual getxattr 
  'trusted.glusterfs.node-uuid' which always returns first up subvolume i.e., NODE-1.
  When NODE-1 goes down, the above xattr returns NODE-2 and that is made 'ACTIVE'.
  But when NODE-1 comes back again, the above xattr returns NODE-1 and it is made
  'ACTIVE' again. So for a brief interval of time, if NODE-2 had not finished processing
   the changelog, both NODE-2 and NODE-1 will be ACTIVE causing rename race as below.

   https://bugzilla.redhat.com/show_bug.cgi?id=1140183

SOLUTION:
   Don't make NODE-2 'PASSIVE' when NODE-1 comes back again untill NODE-2 goes down.

APPROACH TO SOLVE WHICH I CAN THINK OF:

Have a distributed store of a file, which captures the bricks which are active.
When a NODE goes down, the file is updated with it's replica bricks making
sure, at any point in time, the file has all the bricks to be made active.
Geo-replication worker process is made 'ACTIVE' only if it is in the file.

 Implementation can be in two ways:

  1. Have a distributed store for above implementation. This needs to be thought
     of as distributed store is not in place in glusterd yet.

  2. Other solution is to store in a file similar to existing glusterd global
     configuration file (/var/lib/glusterd/options). When this file is updated,
     version number is incremented. When the node which is gone down, comes up,
     gets this file from peers if it's version number is less that of peers. 

I did a POC with second approach storing list of active bricks 'NodeUUID:brickpath'
in options file itself. It seems to work fine except the bug in glusterd where the
daemons are getting spawned before the node gets 'options' file from other node during
handshake.

CHANGES IN GLUSTERD:
    When a node goes down, all the other nodes are notified through glusterd_peer_rpc_notify,
  where, it needs to find the replicas of the node which went down and update the global
  file.

PROBLEMS/LIMITATIONS WITH THIS APPRAOCH:
    1. If glusterd is killed and the node is still up, this makes the other replica 'ACTIVE'.
       So both replica bricks will be syncing at this point of time which is not expected.

    2. If the single brick process is killed, it's replica brick is not made 'ACTIVE'.

Glusterd/AFR folks,

    1. Do you see a better approach other than above to solve this issue? 
    2. Is this approach feasible? If yes, how can I handle the problems mentioned above ?
    3. Is this approach feasible from scalability point of view since complete list of active
       brick path is stored and read by gsyncd ?
    3. Does this approach fits into three way replication and erasure coding?

Thanks and Regards,
Kotresh H R 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel