Re: geo-rep regression because of node-uuid change

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Wed, 5 Jul 2017 15:58:30 +0530

On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:
Hi Pranith,

On 03/07/17 08:33, Pranith Kumar Karampuri wrote:

Xavi,

      Now that the change has been reverted, we can resume this

discussion and decide on the exact format that considers, tier, dht,

afr, ec. People working geo-rep/dht/afr/ec had an internal discussion

and we all agreed that this proposal would be a good way forward. I

think once we agree on the format and decide on the initial

encoding/decoding functions of the xattr and this change is merged, we

can send patches on afr/ec/dht and geo-rep to take it to closure.

Could you propose the new format you have in mind that considers all of

the xlators?

My idea was to create a new xattr not bound to any particular function but which could give enough information to be used in many places.

Currently we have another attribute called glusterfs.pathinfo that returns hierarchical information about the location of a file. Maybe we can extend this to unify all these attributes into a single feature that could be used for multiple purposes.

Since we have time to discuss it, I would like to design it with more information than we already talked.

First of all, the amount of information that this attribute can contain is quite big if we expect to have volumes with thousands of bricks. Even in the most simple case of returning only an UUID, we can easily go beyond the limit of 64KB.

Consider also, for example, what shard should return when pathinfo is requested for a file. Probably it should return a list of shards, each one with all its associated pathinfo. We are talking about big amounts of data here.

I think this kind of information doesn't fit very well in an extended attribute. Another think to consider is that most probably the requester of the data only needs a fragment of it, so we are generating big amounts of data only to be parsed and reduced later, dismissing most of it.

What do you think about using a very special virtual file to manage all this information ? it could be easily read using normal read fops, so it could manage big amounts of data easily. Also, accessing only to some parts of the file we could go directly where we want, avoiding the read of all remaining data.

A very basic idea could be this:

Each xlator would have a reserved area of the file. We can reserve up to 4GB per xlator (32 bits). The remaining 32 bits of the offset would indicate the xlator we want to access.

At offset 0 we have generic information about the volume. One of the the things that this information should include is a basic hierarchy of the whole volume and the offset for each xlator.

After reading this, the user will seek to the desired offset and read the information related to the xlator it is interested in.

All the information should be stored in a format easily extensible that will be kept compatible even if new information is added in the future (for example doing special mappings of the 32 bits offsets reserved for the xlator).

For example we can reserve the first megabyte of the xlator area to have a mapping of attributes with its respective offset.

I think that using a binary format would simplify all this a lot.

Do you think this is a way to explore or should I stop wasting time here ?

I think this just became a very big feature :-). Shall we just live with it the way it is now?

Xavi

On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya

<ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>> wrote:

    On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez

    <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote:

        That's ok. I'm currently unable to write a patch for this on ec.

    Sunil is working on this patch.

    ~Karthik

        If no one can do it, I can try to do it in 6 - 7 hours...

        Xavi

        On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri

        <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>> wrote:

        On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez

        <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote:

            I'm ok with reverting node-uuid content to the previous

            format and create a new xattr for the new format.

            Currently, only rebalance will use it.

            Only thing to consider is what can happen if we have a

            half upgraded cluster where some clients have this change

            and some not. Can rebalance work in this situation ? if

            so, could there be any issue ?

        I think there shouldn't be any problem, because this is

        in-memory xattr so layers below afr/ec will only see node-uuid

        xattr.

        This also gives us a chance to do whatever we want to do in

        future with this xattr without any problems about backward

        compatibility.

        You can check

        https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507

        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>

        for how karthik implemented this in AFR (this got merged

        accidentally yesterday, but looks like this is what we are

        settling on)

            Xavi

            On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar

            Karampuri <pkarampu@xxxxxxxxxx

            <mailto:pkarampu@xxxxxxxxxx>> wrote:

            On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran

            <nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>> wrote:

                On 20 June 2017 at 20:38, Aravinda

                <avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>> wrote:

                    On 06/20/2017 06:02 PM, Pranith Kumar Karampuri

                    wrote:

                    Xavi, Aravinda and I had a discussion on

                    #gluster-dev and we agreed to go with the format

                    Aravinda suggested for now and in future we

                    wanted some more changes for dht to detect which

                    subvolume went down came back up, at that time

                    we will revisit the solution suggested by Xavi.

                    Susanth is doing the dht changes

                    Aravinda is doing geo-rep changes

                    Done. Geo-rep patch sent for review

                    https://review.gluster.org/17582

                    <https://review.gluster.org/17582>

                The proposed changes to the node-uuid behaviour

                (while good) are going to break tiering . Tiering

                changes will take a little more time to be coded and

                tested.

                As this is a regression for 3.11 and a blocker for

                3.11.1, I suggest we go back to the original

                node-uuid behaviour for now so as to unblock the

                release and target the proposed changes for the next

                3.11 releases.

            Let me see if I understand the changes correctly. We are

            restoring the behavior of node-uuid xattr and adding a

            new xattr for parallel rebalance for both afr and ec,

            correct? Otherwise that is one more regression. If yes,

            we will also wait for Xavi's inputs. Jeff accidentally

            merged the afr patch yesterday which does these changes.

            If everyone is in agreement, we will leave it as is and

            add similar changes in ec as well. If we are not in

            agreement, then we will let the discussion progress :-)

                Regards,

                Nithya

                    --

                    Aravinda

                    Thanks to all of you guys for the discussions!

                    On Tue, Jun 20, 2017 at 5:05 PM, Xavier

                    Hernandez <xhernandez@xxxxxxxxxx

                    <mailto:xhernandez@xxxxxxxxxx>> wrote:

                        Hi Aravinda,

                        On 20/06/17 12:42, Aravinda wrote:

                            I think following format can be easily

                            adopted by all components

                            UUIDs of a subvolume are seperated by

                            space and subvolumes are separated

                            by comma

                            For example, node1 and node2 are replica

                            with U1 and U2 UUIDs

                            respectively and

                            node3 and node4 are replica with U3 and

                            U4 UUIDs respectively

                            node-uuid can return "U1 U2,U3 U4"

                        While this is ok for current implementation,

                        I think this can be insufficient if there

                        are more layers of xlators that require to

                        indicate some sort of grouping. Some

                        representation that can represent hierarchy

                        would be better. For example: "(U1 U2) (U3

                        U4)" (we can use spaces or comma as a

                        separator).

                            Geo-rep can split by "," and then split

                            by space and take first UUID

                            DHT can split the value by space or

                            comma and get unique UUIDs list

                        This doesn't solve the problem I described

                        in the previous email. Some more logic will

                        need to be added to avoid more than one node

                        from each replica-set to be active. If we

                        have some explicit hierarchy information in

                        the node-uuid value, more decisions can be

                        taken.

                        An initial proposal I made was this:

                        DHT[2](AFR[2,0](NODE(U1), NODE(U2)),

                        AFR[2,0](NODE(U1), NODE(U2)))

                        This is harder to parse, but gives a lot of

                        information: DHT with 2 subvolumes, each

                        subvolume is an AFR with replica 2 and no

                        arbiters. It's also easily extensible with

                        any new xlator that changes the layout.

                        However maybe this is not the moment to do

                        this, and probably we could implement this

                        in a new xattr with a better name.

                        Xavi

                            Another question is about the behavior

                            when a node is down, existing

                            node-uuid xattr will not return that

                            UUID if a node is down. What is the

                            behavior with the proposed xattr?

                            Let me know your thoughts.

                            regards

                            Aravinda VK

                            On 06/20/2017 03:06 PM, Aravinda wrote:

                                Hi Xavi,

                                On 06/20/2017 02:51 PM, Xavier

                                Hernandez wrote:

                                    Hi Aravinda,

                                    On 20/06/17 11:05, Pranith Kumar

                                    Karampuri wrote:

                                        Adding more people to get a

                                        consensus about this.

                                        On Tue, Jun 20, 2017 at 1:49

                                        PM, Aravinda

                                        <avishwan@xxxxxxxxxx

                                        <mailto:avishwan@xxxxxxxxxx>

                                        <mailto:avishwan@xxxxxxxxxx

                                        <mailto:avishwan@xxxxxxxxxx>>>

                                        wrote:

                                            regards

                                            Aravinda VK

                                            On 06/20/2017 01:26 PM,

                                        Xavier Hernandez wrote:

                                                Hi Pranith,

                                                adding

                                        gluster-devel, Kotresh and

                                        Aravinda,

                                                On 20/06/17 09:45,

                                        Pranith Kumar Karampuri wrote:

                                                    On Tue, Jun 20,

                                        2017 at 1:12 PM, Xavier

                                        Hernandez

                                        <xhernandez@xxxxxxxxxx

                                        <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>>

                                        wrote:

                                                        On 20/06/17

                                        09:31, Pranith Kumar

                                        Karampuri wrote:

                                                            The way

                                        geo-replication works is:

                                                            On each

                                        machine, it does getxattr of

                                        node-uuid and

                                                    check if its

                                                            own uuid

                                                            is

                                        present in the list. If it

                                        is present then it

                                                    will consider

                                                            it active

                                        otherwise it will be

                                        considered passive. With this

                                                    change we are

                                                            giving

                                                            all

                                        uuids instead of first-up

                                        subvolume. So all

                                                    machines think

                                                            they are

                                                            ACTIVE

                                        which is bad apparently. So

                                        that is the

                                                    reason. Even I

                                                            felt bad

                                                            that we

                                        are doing this change.

                                                        And what

                                        about changing the content

                                        of node-uuid to

                                                    include some

                                                        sort of

                                        hierarchy ?

                                                        for example:

                                                        a single brick:

                                                        NODE(<guid>)

                                                        AFR/EC:

                                        AFR[2](NODE(<guid>),

                                        NODE(<guid>))

                                        EC[3,1](NODE(<guid>),

                                        NODE(<guid>), NODE(<guid>))

                                                        DHT:

                                        DHT[2](AFR[2](NODE(<guid>),

                                        NODE(<guid>)),

                                                    AFR[2](NODE(<guid>),

                                                        NODE(<guid>)))

                                                        This gives a

                                        lot of information that can

                                        be used to

                                        take the

                                                        appropriate

                                        decisions.

                                                    I guess that is

                                        not backward compatible.

                                        Shall I CC

                                                    gluster-devel and

                                                    Kotresh/Aravinda?

                                                Is the change we did

                                        backward compatible ? if we

                                        only require

                                                the first field to

                                        be a GUID to support

                                        backward compatibility,

                                                we can use something

                                        like this:

                                            No. But the necessary

                                        change can be made to

                                        Geo-rep code as well if

                                            format is changed, Since

                                        all these are built/shipped

                                        together.

                                            Geo-rep uses node-id as

                                        follows,

                                            list = listxattr(node-uuid)

                                            active_node_uuids =

                                        list.split(SPACE)

                                            active_node_flag = True

                                        if self.node_id exists in

                                        active_node_uuids

                                            else False

                                    How was this case solved ?

                                    suppose we have three servers

                                    and 2 bricks in each server. A

                                    replicated volume is created

                                    using the following command:

                                    gluster volume create test

                                    replica 2 server1:/brick1

                                    server2:/brick1

                                    server2:/brick2 server3:/brick1

                                    server3:/brick1 server1:/brick2

                                    In this case we have three

                                    replica-sets:

                                    * server1:/brick1 server2:/brick1

                                    * server2:/brick2 server3:/brick1

                                    * server3:/brick2 server2:/brick2

                                    Old AFR implementation for

                                    node-uuid always returned the

                                    uuid of the

                                    node of the first brick, so in

                                    this case we will get the uuid

                                    of the

                                    three nodes because all of them

                                    are the first brick of a

                                    replica-set.

                                    Does this mean that with this

                                    configuration all nodes are

                                    active ? Is

                                    this a problem ? Is there any

                                    other check to avoid this

                                    situation if

                                    it's not good ?

                                Yes all Geo-rep workers will become

                                Active and participate in syncing.

                                Since changelogs will have the same

                                information in replica bricks this

                                will lead to duplicate syncing and

                                consuming network bandwidth.

                                Node-uuid based Active worker is the

                                default configuration in Geo-rep

                                till now, Geo-rep also has Meta

                                Volume based syncronization for Active

                                worker using lock files.(Can be

                                opted using Geo-rep configuration,

                                with this config node-uuid will not

                                be used)

                                Kotresh proposed a solution to

                                configure which worker to become

                                Active. This will give more control

                                to Admin to choose Active workers,

                                This will become default

                                configuration from 3.12

                                https://github.com/gluster/glusterfs/issues/244

                                <https://github.com/gluster/glusterfs/issues/244>

                                --

                                Aravinda

                                    Xavi

                                                Bricks:

                                                <guid>

                                                AFR/EC:

                                                <guid>(<guid>, <guid>)

                                                DHT:

                                        <guid>(<guid>(<guid>, ...),

                                        <guid>(<guid>, ...))

                                                In this case, AFR

                                        and EC would return the same

                                        <guid> they

                                                returned before the

                                        patch, but between '(' and

                                        ')' they put the

                                                full list of guid's

                                        of all nodes. The first

                                        <guid> can be used

                                                by geo-replication.

                                        The list after the first

                                        <guid> can be used

                                                for rebalance.

                                                Not sure if there's

                                        any user of node-uuid above DHT.

                                                Xavi

                                                        Xavi

                                                            On Tue,

                                        Jun 20, 2017 at 12:46 PM,

                                        Xavier Hernandez

                                        <xhernandez@xxxxxxxxxx

                                        <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>>>

                                                            wrote:

                                                                Hi

                                        Pranith,

                                                                On

                                        20/06/17 07:53, Pranith

                                        Kumar Karampuri

                                        wrote:

                                        hi Xavi,

                                               We all made the

                                        mistake of not

                                                    sending about

                                        changing

                                        behavior of

                                        node-uuid xattr so that

                                        rebalance can use

                                                    multiple nodes

                                                            for doing

                                        rebalance. Because of this

                                        on geo-rep all

                                                    the workers

                                                            are becoming

                                        active instead of one per

                                        EC/AFR subvolume.

                                                    So we are

                                        frantically trying

                                        to restore the functionality

                                        of node-uuid

                                                    and introduce

                                                            a new

                                        xattr for

                                        the new behavior. Sunil will

                                        be sending out

                                                    a patch for

                                                            this.

                                        Wouldn't it be better to

                                        change geo-rep

                                        behavior

                                                    to use the

                                                            new data

                                                                ? I

                                        think it's better as it's

                                        now, since it

                                                    gives more

                                                            information

                                                                to

                                        upper layers so that they

                                        can take more

                                                    accurate decisions.

                                                                Xavi

                                                                    --

                                        Pranith

                                                            --

                                                            Pranith

                                                    --

                                                    Pranith

                                        --

                                        Pranith

                    --

                    Pranith

            --

            Pranith

        --

        Pranith

--

Pranith

-- 
Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel