Re: geo-rep regression because of node-uuid change

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07/07/17 11:25, Pranith Kumar Karampuri wrote:


On Fri, Jul 7, 2017 at 2:46 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> wrote:

    On 07/07/17 10:12, Pranith Kumar Karampuri wrote:



        On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez
        <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>
        wrote:

            Hi Pranith,

            On 05/07/17 12:28, Pranith Kumar Karampuri wrote:



                On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez
                <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>
                wrote:

                    Hi Pranith,

                    On 03/07/17 08:33, Pranith Kumar Karampuri wrote:

                        Xavi,
                              Now that the change has been reverted, we can
                resume this
                        discussion and decide on the exact format that
                considers, tier, dht,
                        afr, ec. People working geo-rep/dht/afr/ec had
        an internal
                        discussion
                        and we all agreed that this proposal would be a
        good way
                forward. I
                        think once we agree on the format and decide on
        the initial
                        encoding/decoding functions of the xattr and
        this change is
                        merged, we
                        can send patches on afr/ec/dht and geo-rep to
        take it to
                closure.

                        Could you propose the new format you have in
        mind that
                considers
                        all of
                        the xlators?


                    My idea was to create a new xattr not bound to any
        particular
                    function but which could give enough information to
        be used
                in many
                    places.

                    Currently we have another attribute called
                glusterfs.pathinfo that
                    returns hierarchical information about the location of a
                file. Maybe
                    we can extend this to unify all these attributes
        into a single
                    feature that could be used for multiple purposes.

                    Since we have time to discuss it, I would like to
        design it with
                    more information than we already talked.

                    First of all, the amount of information that this
        attribute can
                    contain is quite big if we expect to have volumes with
                thousands of
                    bricks. Even in the most simple case of returning
        only an
                UUID, we
                    can easily go beyond the limit of 64KB.

                    Consider also, for example, what shard should return
        when
                pathinfo
                    is requested for a file. Probably it should return a
        list of
                shards,
                    each one with all its associated pathinfo. We are
        talking
                about big
                    amounts of data here.

                    I think this kind of information doesn't fit very
        well in an
                    extended attribute. Another think to consider is
        that most
                probably
                    the requester of the data only needs a fragment of
        it, so we are
                    generating big amounts of data only to be parsed and
        reduced
                later,
                    dismissing most of it.

                    What do you think about using a very special virtual
        file to
                manage
                    all this information ? it could be easily read using
        normal read
                    fops, so it could manage big amounts of data easily.
        Also,
                accessing
                    only to some parts of the file we could go directly
        where we
                want,
                    avoiding the read of all remaining data.

                    A very basic idea could be this:

                    Each xlator would have a reserved area of the file.
        We can
                reserve
                    up to 4GB per xlator (32 bits). The remaining 32
        bits of the
                offset
                    would indicate the xlator we want to access.

                    At offset 0 we have generic information about the
        volume.
                One of the
                    the things that this information should include is a
        basic
                hierarchy
                    of the whole volume and the offset for each xlator.

                    After reading this, the user will seek to the
        desired offset and
                    read the information related to the xlator it is
        interested in.

                    All the information should be stored in a format easily
                extensible
                    that will be kept compatible even if new information is
                added in the
                    future (for example doing special mappings of the 32
        bits
                offsets
                    reserved for the xlator).

                    For example we can reserve the first megabyte of the
        xlator
                area to
                    have a mapping of attributes with its respective offset.

                    I think that using a binary format would simplify
        all this a
                lot.

                    Do you think this is a way to explore or should I stop
                wasting time
                    here ?


                I think this just became a very big feature :-). Shall
        we just
                live with
                it the way it is now?


            I supposed it...

            Only thing we need to check is if shard needs to handle this
        xattr.
            If so, what it should return ? only the UUID's corresponding
        to the
            first shard or the UUID's of all bricks containing at least one
            shard ? I guess that the first one is enough, but just to be
        sure...

            My proposal was to implement a new xattr, for example
            glusterfs.layout, that contains enough information to be
        usable in
            all current use cases.


        Actually pathinfo is supposed to give this information and it
        already
        has the following format: for a 5x2 distributed-replicate volume


    Yes, I know. I wanted to unify all information.


        root@dhcp35-190 - /mnt/v3
        13:38:12 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d
        # file: d
        trusted.glusterfs.pathinfo="((<DISTRIBUTE:v3-dht>
        (<REPLICATE:v3-replicate-0>
        <POSIX(/home/gfs/v3_0):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_0/d>
        <POSIX(/home/gfs/v3_1):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_1/d>)
        (<REPLICATE:v3-replicate-2>
        <POSIX(/home/gfs/v3_5):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_5/d>
        <POSIX(/home/gfs/v3_4):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_4/d>)
        (<REPLICATE:v3-replicate-1>
        <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d>
        <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d>)
        (<REPLICATE:v3-replicate-4>
        <POSIX(/home/gfs/v3_8):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_8/d>
        <POSIX(/home/gfs/v3_9):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_9/d>)
        (<REPLICATE:v3-replicate-3>
        <POSIX(/home/gfs/v3_6):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_6/d>
        <POSIX(/home/gfs/v3_7):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_7/d>))
        (v3-dht-layout (v3-replicate-0 0 858993458) (v3-replicate-1
        858993459
        1717986917) (v3-replicate-2 1717986918 2576980376) (v3-replicate-3
        2576980377 3435973835) (v3-replicate-4 3435973836 4294967295)))"


        root@dhcp35-190 - /mnt/v3
        13:38:26 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d/a
        # file: d/a
        trusted.glusterfs.pathinfo="(<DISTRIBUTE:v3-dht>
        (<REPLICATE:v3-replicate-1>
        <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d/a>
        <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d/a>))"




            The idea would be that each xlator that makes a significant
        change
            in the way or the place where files are stored, should put
            information in this xattr. The information should include:

            * Type (basically AFR, EC, DHT, ...)
            * Basic configuration (replication and arbiter for AFR, data and
            redundancy for EC, # subvolumes for DHT, shard size for
        sharding, ...)
            * Quorum imposed by the xlator
            * UUID data comming from subvolumes (sorted by brick position)
            * It should be easily extensible in the future

            The last point is very important to avoid the issues we have
        seen
            now. We must be able to incorporate more information without
            breaking backward compatibility. To do so, we can add tags
        for each
            value.

            For example, a distribute 2, replica 2 volume with 1 arbiter
        should
            be represented by this string:

               DHT[dist=2,quorum=1](
                  AFR[rep=2,arbiter=1,quorum=2](
                     NODE[quorum=2,uuid=<UUID1>](<path1>),
                     NODE[quorum=2,uuid=<UUID2>](<path2>),
                     NODE[quorum=2,uuid=<UUID3>](<path3>)
                  ),
                  AFR[rep=2,arbiter=1,quorum=2](
                     NODE[quorum=2,uuid=<UUID4>](<path4>),
                     NODE[quorum=2,uuid=<UUID5>](<path5>),
                     NODE[quorum=2,uuid=<UUID6>](<path6>)
                  )
               )

            Some explanations:

            AFAIK DHT doesn't have quorum, so the default is '1'. We may
        decide
            to omit it when it's '1' for any xlator.

            Quorum in AFR represents client-side enforced quorum. Quorum
        in NODE
            represents the server-side enforced quorum.

            The <path> shown in each NODE represents the physical
        location of
            the file (similar to current glusterfs.pathinfo) because
        this xattr
            can be retrieved for a particular file using getxattr. This
        is nice,
            but we can remove it for now if it's difficult to implement.

            We can decide to have a verbose string or try to omit some
        fields
            when not strictly necessary. For example, if there are no
        arbiters,
            we can omit the 'arbiter' tag instead of writing 'arbiter=0'. We
            could also implicitly compute 'dist' and 'rep' from the
        number of
            elements contained between '()'.

            What do you think ?


        Quite a few people are already familiar with path-info. So I am
        of the
        opinion that we give this information for that xattr itself.
        This xattr
        hasn't changed after quorum/arbiter/shard came in, so may be
        they should?


    Not sure how easy would it be to change the format of path-info to
    incorporate the new information without breaking existing features
    or even user scripts based on it. Maybe a new xattr would be easier
    to implement and adapt.


Probably.



    I missed one important thing in the format: an xlator may have
    per-subvolume information. This information can be placed just
    before each subvolume information:

       DHT[dist=2,quorum=1](
          [hash-range=0x00000000-0x7fffffff]AFR[...](...),
          [hash-range=0x80000000-0xffffffff]AFR[...](...)
       )


Yes, makes sense.

In general I am better at solving problems someone faces, because things
will be more concrete. Do you think it is better to wait until the first
consumer of this functionality comes along and gives their inputs about
what would be nice to have vs must have? At the moment I am not sure how
to distinguish what must be there vs what is nice to have :-(.

The good thing is that using this format we can easily start with bare minimum information, like this:

   DHT(
      AFR(
         NODE[uuid=<UUID1>],
         NODE[uuid=<UUID2>],
         NODE[uuid=<UUID3>]
      ),
      AFR(
         NODE[uuid=<UUID1>],
         NODE[uuid=<UUID2>],
         NODE[uuid=<UUID3>]
      )
   )

And add more information as it is needed, since it won't break backward compatibility.

Xavi



    Xavi




            Xavi




                    Xavi




                        On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
                        <ksubrahm@xxxxxxxxxx
        <mailto:ksubrahm@xxxxxxxxxx> <mailto:ksubrahm@xxxxxxxxxx
        <mailto:ksubrahm@xxxxxxxxxx>>
                <mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>
        <mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>>>
                        <mailto:ksubrahm@xxxxxxxxxx
        <mailto:ksubrahm@xxxxxxxxxx> <mailto:ksubrahm@xxxxxxxxxx
        <mailto:ksubrahm@xxxxxxxxxx>>
                <mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>
        <mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>>>>> wrote:



                            On Wed, Jun 21, 2017 at 1:56 PM, Xavier
        Hernandez
                            <xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>
                        <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>>
                        wrote:

                                That's ok. I'm currently unable to write
        a patch for
                        this on ec.

                            Sunil is working on this patch.

                            ~Karthik

                                If no one can do it, I can try to do it
        in 6 - 7
                hours...

                                Xavi


                                On Wednesday, June 21, 2017 09:48 CEST,
        Pranith
                Kumar
                        Karampuri
                                <pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>> <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>>
                        <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx> <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>>
                <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>
        <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>>>> wrote:



                                    On Wed, Jun 21, 2017 at 1:00 PM, Xavier
                Hernandez
                                    <xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                            <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                            <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>> wrote:

                                        I'm ok with reverting node-uuid
        content
                to the
                            previous
                                        format and create a new xattr
        for the
                new format.
                                        Currently, only rebalance will
        use it.

                                        Only thing to consider is what can
                happen if we
                            have a
                                        half upgraded cluster where some
        clients
                have
                            this change
                                        and some not. Can rebalance work
        in this
                            situation ? if
                                        so, could there be any issue ?


                                    I think there shouldn't be any problem,
                because this is
                                    in-memory xattr so layers below
        afr/ec will
                only see
                            node-uuid
                                    xattr.
                                    This also gives us a chance to do
        whatever
                we want
                            to do in
                                    future with this xattr without any
        problems
                about
                            backward
                                    compatibility.

                                    You can check



        https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>

        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>


        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>

        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>



        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>

        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>


        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>

        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>>
                                    for how karthik implemented this in AFR
                (this got merged
                                    accidentally yesterday, but looks
        like this
                is what
                            we are
                                    settling on)



                                        Xavi


                                        On Wednesday, June 21, 2017
        06:56 CEST,
                Pranith
                            Kumar
                                        Karampuri <pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>
                            <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>>
                                        <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>
                            <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>>>>> wrote:



                                            On Wed, Jun 21, 2017 at
        10:07 AM, Nithya
                                Balachandran
                                            <nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>
                <mailto:nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>>
                                <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>
                <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>>> <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>
                <mailto:nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>>
                                <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>
                <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>>>>> wrote:


                                                On 20 June 2017 at
        20:38, Aravinda
                                                <avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
                                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>>> <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
                                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>>>>> wrote:

                                                    On 06/20/2017 06:02
        PM, Pranith
                                Kumar Karampuri
                                                    wrote:

                                                        Xavi, Aravinda
        and I had a
                                    discussion on
                                                        #gluster-dev and we
                agreed to go
                                    with the format
                                                        Aravinda
        suggested for
                now and
                                    in future we
                                                        wanted some more
        changes
                for dht
                                    to detect which
                                                        subvolume went
        down came
                back
                                    up, at that time
                                                        we will revisit
        the solution
                                    suggested by Xavi.

                                                        Susanth is doing
        the dht
                changes
                                                        Aravinda is doing
                geo-rep changes

                                                    Done. Geo-rep patch
        sent for
                review

                https://review.gluster.org/17582
        <https://review.gluster.org/17582>
        <https://review.gluster.org/17582
        <https://review.gluster.org/17582>>
                                <https://review.gluster.org/17582
        <https://review.gluster.org/17582>
                <https://review.gluster.org/17582
        <https://review.gluster.org/17582>>>

                <https://review.gluster.org/17582
        <https://review.gluster.org/17582>
        <https://review.gluster.org/17582
        <https://review.gluster.org/17582>>
                                <https://review.gluster.org/17582
        <https://review.gluster.org/17582>
                <https://review.gluster.org/17582
        <https://review.gluster.org/17582>>>>



                                                The proposed changes to the
                node-uuid
                                behaviour
                                                (while good) are going
        to break
                tiering
                                . Tiering
                                                changes will take a
        little more
                time to
                                be coded and
                                                tested.

                                                As this is a regression
        for 3.11
                and a
                                blocker for
                                                3.11.1, I suggest we go
        back to
                the original
                                                node-uuid behaviour for
        now so as to
                                unblock the
                                                release and target the
        proposed
                changes
                                for the next
                                                3.11 releases.


                                            Let me see if I understand
        the changes
                                correctly. We are
                                            restoring the behavior of
        node-uuid
                xattr
                                and adding a
                                            new xattr for parallel
        rebalance for
                both
                                afr and ec,
                                            correct? Otherwise that is
        one more
                                regression. If yes,
                                            we will also wait for Xavi's
        inputs.
                Jeff
                                accidentally
                                            merged the afr patch
        yesterday which
                does
                                these changes.
                                            If everyone is in agreement,
        we will
                leave
                                it as is and
                                            add similar changes in ec as
        well.
                If we are
                                not in
                                            agreement, then we will let the
                discussion
                                progress :-)




                                                Regards,
                                                Nithya

                                                    --
                                                    Aravinda


                                                        Thanks to all of you
                guys for
                                    the discussions!

                                                        On Tue, Jun 20,
        2017 at
                5:05 PM,
                                    Xavier
                                                        Hernandez
                <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>

                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>> wrote:

                                                            Hi Aravinda,

                                                            On 20/06/17
        12:42,
                Aravinda
                                    wrote:

                                                                I think
                following format
                                    can be easily
                                                                adopted
        by all
                components

                                                                UUIDs of a
                subvolume are
                                    seperated by
                                                                space and
                subvolumes are
                                    separated
                                                                by comma

                                                                For example,
                node1 and
                                    node2 are replica
                                                                with U1
        and U2 UUIDs

        respectively and
                                                                node3
        and node4 are
                                    replica with U3 and
                                                                U4 UUIDs
                respectively


        node-uuid can
                return "U1
                                    U2,U3 U4"


                                                            While this
        is ok for
                current
                                    implementation,
                                                            I think this
        can be
                                    insufficient if there
                                                            are more
        layers of
                xlators
                                    that require to
                                                            indicate
        some sort of
                                    grouping. Some

        representation that can
                                    represent hierarchy
                                                            would be
        better. For
                                    example: "(U1 U2) (U3
                                                            U4)" (we can use
                spaces or
                                    comma as a
                                                            separator).



                                                                Geo-rep can
                split by ","
                                    and then split
                                                                by space and
                take first UUID
                                                                DHT can
        split
                the value
                                    by space or
                                                                comma
        and get unique
                                    UUIDs list


                                                            This doesn't
        solve the
                                    problem I described
                                                            in the previous
                email. Some
                                    more logic will
                                                            need to be
        added to
                avoid
                                    more than one node
                                                            from each
                replica-set to be
                                    active. If we
                                                            have some
        explicit
                hierarchy
                                    information in
                                                            the
        node-uuid value,
                more
                                    decisions can be
                                                            taken.

                                                            An initial
        proposal
                I made
                                    was this:


                DHT[2](AFR[2,0](NODE(U1),
                                    NODE(U2)),

        AFR[2,0](NODE(U1),
                NODE(U2)))

                                                            This is
        harder to
                parse, but
                                    gives a lot of
                                                            information:
        DHT with 2
                                    subvolumes, each
                                                            subvolume is
        an AFR with
                                    replica 2 and no
                                                            arbiters.
        It's also
                easily
                                    extensible with
                                                            any new
        xlator that
                changes
                                    the layout.

                                                            However
        maybe this
                is not
                                    the moment to do
                                                            this, and
        probably
                we could
                                    implement this
                                                            in a new
        xattr with
                a better
                                    name.

                                                            Xavi



                                                                Another
        question is
                                    about the behavior
                                                                when a
        node is down,
                                    existing

        node-uuid xattr
                will not
                                    return that
                                                                UUID if
        a node
                is down.
                                    What is the
                                                                behavior
        with the
                                    proposed xattr?

                                                                Let me
        know your
                thoughts.

                                                                regards
                                                                Aravinda VK

                                                                On
        06/20/2017
                03:06 PM,
                                    Aravinda wrote:

                                                                    Hi Xavi,

                                                                    On
                06/20/2017 02:51
                                    PM, Xavier

        Hernandez wrote:


        Hi Aravinda,


        On 20/06/17
                                    11:05, Pranith Kumar

                Karampuri wrote:


                Adding more
                                    people to get a

                consensus
                                    about this.


            On
                Tue, Jun
                                    20, 2017 at 1:49

            PM,
                Aravinda

                                    <avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>> <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>>

                                    <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
                                    <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>>>

                                    <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>> <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>>

                                    <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
                                    <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>>>>>>

            wrote:



                regards

                Aravinda VK



                On
                                    06/20/2017 01:26 PM,

            Xavier
                                    Hernandez wrote:


                    Hi
                                    Pranith,


                    adding

                                    gluster-devel, Kotresh and

                Aravinda,


                    On
                                    20/06/17 09:45,

            Pranith
                                    Kumar Karampuri wrote:




                                    On Tue, Jun 20,

            2017
                at 1:12
                                    PM, Xavier

                Hernandez


                                    <xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>

                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>

                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>>


                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>


                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>>>>

            wrote:


                                        On 20/06/17

            09:31,
                                    Pranith Kumar

                Karampuri wrote:


                                            The way

                                    geo-replication works is:

                                            On each

                machine, it
                                    does getxattr of

                node-uuid and

                                    check if its

                                            own uuid

                                            is

                present in
                                    the list. If it

            is
                present
                                    then it

                                    will consider

                                            it active


                otherwise it
                                    will be

                considered
                                    passive. With this

                                    change we are

                                            giving

                                            all

            uuids
                                    instead of first-up

                subvolume.
                                    So all

                                    machines think

                                            they are

                                            ACTIVE

                which is bad
                                    apparently. So

            that
                is the

                                    reason. Even I

                                            felt bad

                                            that we

            are
                doing
                                    this change.



                                        And what

            about
                                    changing the content

            of
                node-uuid to

                                    include some

                                        sort of

                hierarchy ?


                                        for example:


                                        a single brick:


                                        NODE(<guid>)


                                        AFR/EC:



                                    AFR[2](NODE(<guid>),

                NODE(<guid>))


                                    EC[3,1](NODE(<guid>),

                                    NODE(<guid>), NODE(<guid>))


                                        DHT:



                                    DHT[2](AFR[2](NODE(<guid>),

                NODE(<guid>)),

                                    AFR[2](NODE(<guid>),

                                        NODE(<guid>)))


                                        This gives a

            lot of
                                    information that can

            be
                used to

            take the

                                        appropriate

                decisions.



                                    I guess that is

            not
                backward
                                    compatible.

                Shall I CC

                                    gluster-devel and

                                    Kotresh/Aravinda?



                    Is
                                    the change we did

            backward
                                    compatible ? if we

            only
                require

                    the
                                    first field to

            be a
                GUID to
                                    support

            backward
                                    compatibility,

                    we
                                    can use something

            like
                this:


                No. But
                                    the necessary

                change can
                                    be made to

                Geo-rep code
                                    as well if

                format
                                    is changed, Since

            all
                these
                                    are built/shipped

                together.


                Geo-rep
                                    uses node-id as

            follows,


                list =
                                    listxattr(node-uuid)

                                    active_node_uuids =

                                    list.split(SPACE)

                                    active_node_flag = True

            if
                                    self.node_id exists in

                                    active_node_uuids

                else False



        How was this
                                    case solved ?


        suppose
                we have
                                    three servers

        and 2
                bricks in
                                    each server. A

        replicated
                                    volume is created

        using the
                                    following command:


        gluster
                volume
                                    create test

        replica 2
                                    server1:/brick1

                server2:/brick1

                server2:/brick2
                                    server3:/brick1

                server3:/brick1
                                    server1:/brick2


        In this
                case we
                                    have three

                replica-sets:

                                                                        *
                                    server1:/brick1 server2:/brick1
                                                                        *
                                    server2:/brick2 server3:/brick1
                                                                        *
                                    server3:/brick2 server2:/brick2


        Old AFR
                                    implementation for

                node-uuid always
                                    returned the

        uuid of the

        node of the
                                    first brick, so in

        this case we
                                    will get the uuid

        of the

        three nodes
                                    because all of them

        are the
                first
                                    brick of a

        replica-set.

                                                                        Does
                this mean
                                    that with this

                configuration
                                    all nodes are

        active ? Is

        this a
                problem ?
                                    Is there any

        other
                check to
                                    avoid this

        situation if

        it's not
                good ?

                                                                    Yes
        all Geo-rep
                                    workers will become

        Active and
                                    participate in syncing.

        Since changelogs
                                    will have the same

        information in
                                    replica bricks this
                                                                    will
        lead to
                                    duplicate syncing and

        consuming
                network
                                    bandwidth.


        Node-uuid based
                                    Active worker is the
                                                                    default




--
Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel





[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux