Re: geo-rep regression because of node-uuid change

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Tue, 4 Jul 2017 10:56:26 +0200

Hi Pranith,

On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
Xavi,
      Now that the change has been reverted, we can resume this
discussion and decide on the exact format that considers, tier, dht,
afr, ec. People working geo-rep/dht/afr/ec had an internal discussion
and we all agreed that this proposal would be a good way forward. I
think once we agree on the format and decide on the initial
encoding/decoding functions of the xattr and this change is merged, we
can send patches on afr/ec/dht and geo-rep to take it to closure.

Could you propose the new format you have in mind that considers all of
the xlators?

My idea was to create a new xattr not bound to any particular function 
but which could give enough information to be used in many places.

Currently we have another attribute called glusterfs.pathinfo that 
returns hierarchical information about the location of a file. Maybe we 
can extend this to unify all these attributes into a single feature that 
could be used for multiple purposes.

Since we have time to discuss it, I would like to design it with more 
information than we already talked.

First of all, the amount of information that this attribute can contain 
is quite big if we expect to have volumes with thousands of bricks. Even 
in the most simple case of returning only an UUID, we can easily go 
beyond the limit of 64KB.

Consider also, for example, what shard should return when pathinfo is 
requested for a file. Probably it should return a list of shards, each 
one with all its associated pathinfo. We are talking about big amounts 
of data here.

I think this kind of information doesn't fit very well in an extended 
attribute. Another think to consider is that most probably the requester 
of the data only needs a fragment of it, so we are generating big 
amounts of data only to be parsed and reduced later, dismissing most of it.

What do you think about using a very special virtual file to manage all 
this information ? it could be easily read using normal read fops, so it 
could manage big amounts of data easily. Also, accessing only to some 
parts of the file we could go directly where we want, avoiding the read 
of all remaining data.

A very basic idea could be this:

Each xlator would have a reserved area of the file. We can reserve up to 
4GB per xlator (32 bits). The remaining 32 bits of the offset would 
indicate the xlator we want to access.

At offset 0 we have generic information about the volume. One of the the 
things that this information should include is a basic hierarchy of the 
whole volume and the offset for each xlator.

After reading this, the user will seek to the desired offset and read 
the information related to the xlator it is interested in.

All the information should be stored in a format easily extensible that 
will be kept compatible even if new information is added in the future 
(for example doing special mappings of the 32 bits offsets reserved for 
the xlator).

For example we can reserve the first megabyte of the xlator area to have 
a mapping of attributes with its respective offset.

I think that using a binary format would simplify all this a lot.

Do you think this is a way to explore or should I stop wasting time here ?

Xavi

On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
<ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>> wrote:

    On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez
    <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote:

        That's ok. I'm currently unable to write a patch for this on ec.

    Sunil is working on this patch.

    ~Karthik

        If no one can do it, I can try to do it in 6 - 7 hours...

        Xavi

        On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri
        <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>> wrote:

        On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez
        <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote:

            I'm ok with reverting node-uuid content to the previous
            format and create a new xattr for the new format.
            Currently, only rebalance will use it.

            Only thing to consider is what can happen if we have a
            half upgraded cluster where some clients have this change
            and some not. Can rebalance work in this situation ? if
            so, could there be any issue ?

        I think there shouldn't be any problem, because this is
        in-memory xattr so layers below afr/ec will only see node-uuid
        xattr.
        This also gives us a chance to do whatever we want to do in
        future with this xattr without any problems about backward
        compatibility.

        You can check
        https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
        for how karthik implemented this in AFR (this got merged
        accidentally yesterday, but looks like this is what we are
        settling on)

            Xavi

            On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar
            Karampuri <pkarampu@xxxxxxxxxx
            <mailto:pkarampu@xxxxxxxxxx>> wrote:

            On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran
            <nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>> wrote:

                On 20 June 2017 at 20:38, Aravinda
                <avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>> wrote:

                    On 06/20/2017 06:02 PM, Pranith Kumar Karampuri
                    wrote:
                    Xavi, Aravinda and I had a discussion on
                    #gluster-dev and we agreed to go with the format
                    Aravinda suggested for now and in future we
                    wanted some more changes for dht to detect which
                    subvolume went down came back up, at that time
                    we will revisit the solution suggested by Xavi.

                    Susanth is doing the dht changes
                    Aravinda is doing geo-rep changes
                    Done. Geo-rep patch sent for review
                    https://review.gluster.org/17582
                    <https://review.gluster.org/17582>

                The proposed changes to the node-uuid behaviour
                (while good) are going to break tiering . Tiering
                changes will take a little more time to be coded and
                tested.

                As this is a regression for 3.11 and a blocker for
                3.11.1, I suggest we go back to the original
                node-uuid behaviour for now so as to unblock the
                release and target the proposed changes for the next
                3.11 releases.

            Let me see if I understand the changes correctly. We are
            restoring the behavior of node-uuid xattr and adding a
            new xattr for parallel rebalance for both afr and ec,
            correct? Otherwise that is one more regression. If yes,
            we will also wait for Xavi's inputs. Jeff accidentally
            merged the afr patch yesterday which does these changes.
            If everyone is in agreement, we will leave it as is and
            add similar changes in ec as well. If we are not in
            agreement, then we will let the discussion progress :-)

                Regards,
                Nithya

                    --
                    Aravinda

                    Thanks to all of you guys for the discussions!

                    On Tue, Jun 20, 2017 at 5:05 PM, Xavier
                    Hernandez <xhernandez@xxxxxxxxxx
                    <mailto:xhernandez@xxxxxxxxxx>> wrote:

                        Hi Aravinda,

                        On 20/06/17 12:42, Aravinda wrote:

                            I think following format can be easily
                            adopted by all components

                            UUIDs of a subvolume are seperated by
                            space and subvolumes are separated
                            by comma

                            For example, node1 and node2 are replica
                            with U1 and U2 UUIDs
                            respectively and
                            node3 and node4 are replica with U3 and
                            U4 UUIDs respectively

                            node-uuid can return "U1 U2,U3 U4"

                        While this is ok for current implementation,
                        I think this can be insufficient if there
                        are more layers of xlators that require to
                        indicate some sort of grouping. Some
                        representation that can represent hierarchy
                        would be better. For example: "(U1 U2) (U3
                        U4)" (we can use spaces or comma as a
                        separator).

                            Geo-rep can split by "," and then split
                            by space and take first UUID
                            DHT can split the value by space or
                            comma and get unique UUIDs list

                        This doesn't solve the problem I described
                        in the previous email. Some more logic will
                        need to be added to avoid more than one node
                        from each replica-set to be active. If we
                        have some explicit hierarchy information in
                        the node-uuid value, more decisions can be
                        taken.

                        An initial proposal I made was this:

                        DHT[2](AFR[2,0](NODE(U1), NODE(U2)),
                        AFR[2,0](NODE(U1), NODE(U2)))

                        This is harder to parse, but gives a lot of
                        information: DHT with 2 subvolumes, each
                        subvolume is an AFR with replica 2 and no
                        arbiters. It's also easily extensible with
                        any new xlator that changes the layout.

                        However maybe this is not the moment to do
                        this, and probably we could implement this
                        in a new xattr with a better name.

                        Xavi

                            Another question is about the behavior
                            when a node is down, existing
                            node-uuid xattr will not return that
                            UUID if a node is down. What is the
                            behavior with the proposed xattr?

                            Let me know your thoughts.

                            regards
                            Aravinda VK

                            On 06/20/2017 03:06 PM, Aravinda wrote:

                                Hi Xavi,

                                On 06/20/2017 02:51 PM, Xavier
                                Hernandez wrote:

                                    Hi Aravinda,

                                    On 20/06/17 11:05, Pranith Kumar
                                    Karampuri wrote:

                                        Adding more people to get a
                                        consensus about this.

                                        On Tue, Jun 20, 2017 at 1:49
                                        PM, Aravinda
                                        <avishwan@xxxxxxxxxx
                                        <mailto:avishwan@xxxxxxxxxx>
                                        <mailto:avishwan@xxxxxxxxxx
                                        <mailto:avishwan@xxxxxxxxxx>>>
                                        wrote:

                                            regards
                                            Aravinda VK

                                            On 06/20/2017 01:26 PM,
                                        Xavier Hernandez wrote:

                                                Hi Pranith,

                                                adding
                                        gluster-devel, Kotresh and
                                        Aravinda,

                                                On 20/06/17 09:45,
                                        Pranith Kumar Karampuri wrote:

                                                    On Tue, Jun 20,
                                        2017 at 1:12 PM, Xavier
                                        Hernandez

                                        <xhernandez@xxxxxxxxxx
                                        <mailto:xhernandez@xxxxxxxxxx>
                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>>
                                        wrote:

                                                        On 20/06/17
                                        09:31, Pranith Kumar
                                        Karampuri wrote:

                                                            The way
                                        geo-replication works is:
                                                            On each
                                        machine, it does getxattr of
                                        node-uuid and
                                                    check if its
                                                            own uuid
                                                            is
                                        present in the list. If it
                                        is present then it
                                                    will consider
                                                            it active

                                        otherwise it will be
                                        considered passive. With this
                                                    change we are
                                                            giving
                                                            all
                                        uuids instead of first-up
                                        subvolume. So all
                                                    machines think
                                                            they are
                                                            ACTIVE
                                        which is bad apparently. So
                                        that is the
                                                    reason. Even I
                                                            felt bad
                                                            that we
                                        are doing this change.

                                                        And what
                                        about changing the content
                                        of node-uuid to
                                                    include some
                                                        sort of
                                        hierarchy ?

                                                        for example:

                                                        a single brick:

                                                        NODE(<guid>)

                                                        AFR/EC:

                                        AFR[2](NODE(<guid>),
                                        NODE(<guid>))

                                        EC[3,1](NODE(<guid>),
                                        NODE(<guid>), NODE(<guid>))

                                                        DHT:

                                        DHT[2](AFR[2](NODE(<guid>),
                                        NODE(<guid>)),
                                                    AFR[2](NODE(<guid>),
                                                        NODE(<guid>)))

                                                        This gives a
                                        lot of information that can
                                        be used to
                                        take the
                                                        appropriate
                                        decisions.

                                                    I guess that is
                                        not backward compatible.
                                        Shall I CC
                                                    gluster-devel and
                                                    Kotresh/Aravinda?

                                                Is the change we did
                                        backward compatible ? if we
                                        only require
                                                the first field to
                                        be a GUID to support
                                        backward compatibility,
                                                we can use something
                                        like this:

                                            No. But the necessary
                                        change can be made to
                                        Geo-rep code as well if
                                            format is changed, Since
                                        all these are built/shipped
                                        together.

                                            Geo-rep uses node-id as
                                        follows,

                                            list = listxattr(node-uuid)
                                            active_node_uuids =
                                        list.split(SPACE)
                                            active_node_flag = True
                                        if self.node_id exists in
                                        active_node_uuids
                                            else False

                                    How was this case solved ?

                                    suppose we have three servers
                                    and 2 bricks in each server. A
                                    replicated volume is created
                                    using the following command:

                                    gluster volume create test
                                    replica 2 server1:/brick1
                                    server2:/brick1
                                    server2:/brick2 server3:/brick1
                                    server3:/brick1 server1:/brick2

                                    In this case we have three
                                    replica-sets:

                                    * server1:/brick1 server2:/brick1
                                    * server2:/brick2 server3:/brick1
                                    * server3:/brick2 server2:/brick2

                                    Old AFR implementation for
                                    node-uuid always returned the
                                    uuid of the
                                    node of the first brick, so in
                                    this case we will get the uuid
                                    of the
                                    three nodes because all of them
                                    are the first brick of a
                                    replica-set.

                                    Does this mean that with this
                                    configuration all nodes are
                                    active ? Is
                                    this a problem ? Is there any
                                    other check to avoid this
                                    situation if
                                    it's not good ?

                                Yes all Geo-rep workers will become
                                Active and participate in syncing.
                                Since changelogs will have the same
                                information in replica bricks this
                                will lead to duplicate syncing and
                                consuming network bandwidth.

                                Node-uuid based Active worker is the
                                default configuration in Geo-rep
                                till now, Geo-rep also has Meta
                                Volume based syncronization for Active
                                worker using lock files.(Can be
                                opted using Geo-rep configuration,
                                with this config node-uuid will not
                                be used)

                                Kotresh proposed a solution to
                                configure which worker to become
                                Active. This will give more control
                                to Admin to choose Active workers,
                                This will become default
                                configuration from 3.12
                                https://github.com/gluster/glusterfs/issues/244
                                <https://github.com/gluster/glusterfs/issues/244>

                                --
                                Aravinda

                                    Xavi

                                                Bricks:

                                                <guid>

                                                AFR/EC:
                                                <guid>(<guid>, <guid>)

                                                DHT:

                                        <guid>(<guid>(<guid>, ...),
                                        <guid>(<guid>, ...))

                                                In this case, AFR
                                        and EC would return the same
                                        <guid> they
                                                returned before the
                                        patch, but between '(' and
                                        ')' they put the
                                                full list of guid's
                                        of all nodes. The first
                                        <guid> can be used
                                                by geo-replication.
                                        The list after the first
                                        <guid> can be used
                                                for rebalance.

                                                Not sure if there's
                                        any user of node-uuid above DHT.

                                                Xavi

                                                        Xavi

                                                            On Tue,
                                        Jun 20, 2017 at 12:46 PM,
                                        Xavier Hernandez

                                        <xhernandez@xxxxxxxxxx
                                        <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>

                                        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>>>
                                                            wrote:

                                                                Hi
                                        Pranith,

                                                                On
                                        20/06/17 07:53, Pranith
                                        Kumar Karampuri
                                        wrote:

                                        hi Xavi,

                                               We all made the
                                        mistake of not
                                                    sending about
                                        changing

                                        behavior of

                                        node-uuid xattr so that
                                        rebalance can use
                                                    multiple nodes
                                                            for doing

                                        rebalance. Because of this
                                        on geo-rep all
                                                    the workers
                                                            are becoming

                                        active instead of one per
                                        EC/AFR subvolume.
                                                    So we are

                                        frantically trying

                                        to restore the functionality
                                        of node-uuid
                                                    and introduce
                                                            a new

                                        xattr for

                                        the new behavior. Sunil will
                                        be sending out
                                                    a patch for
                                                            this.

                                        Wouldn't it be better to
                                        change geo-rep
                                        behavior
                                                    to use the
                                                            new data
                                                                ? I
                                        think it's better as it's
                                        now, since it
                                                    gives more
                                                            information
                                                                to
                                        upper layers so that they
                                        can take more
                                                    accurate decisions.

                                                                Xavi

                                                                    --

                                        Pranith

                                                            --
                                                            Pranith

                                                    --
                                                    Pranith

                                        --
                                        Pranith

                    --
                    Pranith

            --
            Pranith

        --
        Pranith

--
Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel