Re: geo-rep regression because of node-uuid change

Karthik Subrahmanya <ksubrahm@xxxxxxxxxx> · Tue, 20 Jun 2017 16:27:17 +0530

On Tue, Jun 20, 2017 at 4:12 PM, Aravinda <avishwan@xxxxxxxxxx> wrote:
I think following format can be easily adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are separated by comma

For example, node1 and node2 are replica with U1 and U2 UUIDs respectively and

node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"

Geo-rep can split by "," and then split by space and take first UUID

DHT can split the value by space or comma and get unique UUIDs list

Another question is about the behavior when a node is down, existing node-uuid xattr will not return that UUID if a node is down.
After the change [1], if a node is down we send all zeros as the uuid for that node, in the list of node uuids.

[1] https://review.gluster.org/#/c/17084/

Regards,
Karthik 
 What is the behavior with the proposed xattr?

Let me know your thoughts.

regards

Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:

Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda <avishwan@xxxxxxxxxx

<mailto:avishwan@xxxxxxxxxx>> wrote:

    regards

    Aravinda VK

    On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

        Hi Pranith,

        adding gluster-devel, Kotresh and Aravinda,

        On 20/06/17 09:45, Pranith Kumar Karampuri wrote:

            On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez

            <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

            <mailto:xhernandez@xxxxxxxxxx

            <mailto:xhernandez@xxxxxxxxxx>>> wrote:

                On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

                    The way geo-replication works is:

                    On each machine, it does getxattr of node-uuid and

            check if its

                    own uuid

                    is present in the list. If it is present then it

            will consider

                    it active

                    otherwise it will be considered passive. With this

            change we are

                    giving

                    all uuids instead of first-up subvolume. So all

            machines think

                    they are

                    ACTIVE which is bad apparently. So that is the

            reason. Even I

                    felt bad

                    that we are doing this change.

                And what about changing the content of node-uuid to

            include some

                sort of hierarchy ?

                for example:

                a single brick:

                NODE(<guid>)

                AFR/EC:

                AFR[2](NODE(<guid>), NODE(<guid>))

                EC[3,1](NODE(<guid>), NODE(<guid>), NODE(<guid>))

                DHT:

                DHT[2](AFR[2](NODE(<guid>), NODE(<guid>)),

            AFR[2](NODE(<guid>),

                NODE(<guid>)))

                This gives a lot of information that can be used to take the

                appropriate decisions.

            I guess that is not backward compatible. Shall I CC

            gluster-devel and

            Kotresh/Aravinda?

        Is the change we did backward compatible ? if we only require

        the first field to be a GUID to support backward compatibility,

        we can use something like this:

    No. But the necessary change can be made to Geo-rep code as well if

    format is changed, Since all these are built/shipped together.

    Geo-rep uses node-id as follows,

    list = listxattr(node-uuid)

    active_node_uuids = list.split(SPACE)

    active_node_flag = True if self.node_id exists in active_node_uuids

    else False

How was this case solved ?

suppose we have three servers and 2 bricks in each server. A replicated volume is created using the following command:

gluster volume create test replica 2 server1:/brick1 server2:/brick1 server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2

In this case we have three replica-sets:

* server1:/brick1 server2:/brick1

* server2:/brick2 server3:/brick1

* server3:/brick2 server2:/brick2

Old AFR implementation for node-uuid always returned the uuid of the node of the first brick, so in this case we will get the uuid of the three nodes because all of them are the first brick of a replica-set.

Does this mean that with this configuration all nodes are active ? Is this a problem ? Is there any other check to avoid this situation if it's not good ?

Yes all Geo-rep workers will become Active and participate in syncing. Since changelogs will have the same information in replica bricks this will lead to duplicate syncing and consuming network bandwidth.

Node-uuid based Active worker is the default configuration in Geo-rep till now, Geo-rep also has Meta Volume based syncronization for Active worker using lock files.(Can be opted using Geo-rep configuration, with this config node-uuid will not be used)

Kotresh proposed a solution to configure which worker to become Active. This will give more control to Admin to choose Active workers, This will become default configuration from 3.12

https://github.com/gluster/glusterfs/issues/244

-- 

Aravinda

Xavi

        Bricks:

        <guid>

        AFR/EC:

        <guid>(<guid>, <guid>)

        DHT:

        <guid>(<guid>(<guid>, ...), <guid>(<guid>, ...))

        In this case, AFR and EC would return the same <guid> they

        returned before the patch, but between '(' and ')' they put the

        full list of guid's of all nodes. The first <guid> can be used

        by geo-replication. The list after the first <guid> can be used

        for rebalance.

        Not sure if there's any user of node-uuid above DHT.

        Xavi

                Xavi

                    On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez

                    <xhernandez@xxxxxxxxxx

            <mailto:xhernandez@xxxxxxxxxx> <mailto:xhernandez@xxxxxxxxxx

            <mailto:xhernandez@xxxxxxxxxx>>

                    <mailto:xhernandez@xxxxxxxxxx

            <mailto:xhernandez@xxxxxxxxxx> <mailto:xhernandez@xxxxxxxxxx

            <mailto:xhernandez@xxxxxxxxxx>>>>

                    wrote:

                        Hi Pranith,

                        On 20/06/17 07:53, Pranith Kumar Karampuri wrote:

                            hi Xavi,

                                   We all made the mistake of not

            sending about changing

                            behavior of

                            node-uuid xattr so that rebalance can use

            multiple nodes

                    for doing

                            rebalance. Because of this on geo-rep all

            the workers

                    are becoming

                            active instead of one per EC/AFR subvolume.

            So we are

                            frantically trying

                            to restore the functionality of node-uuid

            and introduce

                    a new

                            xattr for

                            the new behavior. Sunil will be sending out

            a patch for

                    this.

                        Wouldn't it be better to change geo-rep behavior

            to use the

                    new data

                        ? I think it's better as it's now, since it

            gives more

                    information

                        to upper layers so that they can take more

            accurate decisions.

                        Xavi

                            --

                            Pranith

                    --

                    Pranith

            --

            Pranith

-- 

Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel