Re: geo-rep regression because of node-uuid change

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Fri, Jul 7, 2017 at 3:05 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:
On 07/07/17 11:25, Pranith Kumar Karampuri wrote:


On Fri, Jul 7, 2017 at 2:46 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> wrote:

    On 07/07/17 10:12, Pranith Kumar Karampuri wrote:



        On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez
        <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>
        wrote:

            Hi Pranith,

            On 05/07/17 12:28, Pranith Kumar Karampuri wrote:



                On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez
                <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>
                wrote:

                    Hi Pranith,

                    On 03/07/17 08:33, Pranith Kumar Karampuri wrote:

                        Xavi,
                              Now that the change has been reverted, we can
                resume this
                        discussion and decide on the exact format that
                considers, tier, dht,
                        afr, ec. People working geo-rep/dht/afr/ec had
        an internal
                        discussion
                        and we all agreed that this proposal would be a
        good way
                forward. I
                        think once we agree on the format and decide on
        the initial
                        encoding/decoding functions of the xattr and
        this change is
                        merged, we
                        can send patches on afr/ec/dht and geo-rep to
        take it to
                closure.

                        Could you propose the new format you have in
        mind that
                considers
                        all of
                        the xlators?


                    My idea was to create a new xattr not bound to any
        particular
                    function but which could give enough information to
        be used
                in many
                    places.

                    Currently we have another attribute called
                glusterfs.pathinfo that
                    returns hierarchical information about the location of a
                file. Maybe
                    we can extend this to unify all these attributes
        into a single
                    feature that could be used for multiple purposes.

                    Since we have time to discuss it, I would like to
        design it with
                    more information than we already talked.

                    First of all, the amount of information that this
        attribute can
                    contain is quite big if we expect to have volumes with
                thousands of
                    bricks. Even in the most simple case of returning
        only an
                UUID, we
                    can easily go beyond the limit of 64KB.

                    Consider also, for example, what shard should return
        when
                pathinfo
                    is requested for a file. Probably it should return a
        list of
                shards,
                    each one with all its associated pathinfo. We are
        talking
                about big
                    amounts of data here.

                    I think this kind of information doesn't fit very
        well in an
                    extended attribute. Another think to consider is
        that most
                probably
                    the requester of the data only needs a fragment of
        it, so we are
                    generating big amounts of data only to be parsed and
        reduced
                later,
                    dismissing most of it.

                    What do you think about using a very special virtual
        file to
                manage
                    all this information ? it could be easily read using
        normal read
                    fops, so it could manage big amounts of data easily.
        Also,
                accessing
                    only to some parts of the file we could go directly
        where we
                want,
                    avoiding the read of all remaining data.

                    A very basic idea could be this:

                    Each xlator would have a reserved area of the file.
        We can
                reserve
                    up to 4GB per xlator (32 bits). The remaining 32
        bits of the
                offset
                    would indicate the xlator we want to access.

                    At offset 0 we have generic information about the
        volume.
                One of the
                    the things that this information should include is a
        basic
                hierarchy
                    of the whole volume and the offset for each xlator.

                    After reading this, the user will seek to the
        desired offset and
                    read the information related to the xlator it is
        interested in.

                    All the information should be stored in a format easily
                extensible
                    that will be kept compatible even if new information is
                added in the
                    future (for example doing special mappings of the 32
        bits
                offsets
                    reserved for the xlator).

                    For example we can reserve the first megabyte of the
        xlator
                area to
                    have a mapping of attributes with its respective offset.

                    I think that using a binary format would simplify
        all this a
                lot.

                    Do you think this is a way to explore or should I stop
                wasting time
                    here ?


                I think this just became a very big feature :-). Shall
        we just
                live with
                it the way it is now?


            I supposed it...

            Only thing we need to check is if shard needs to handle this
        xattr.
            If so, what it should return ? only the UUID's corresponding
        to the
            first shard or the UUID's of all bricks containing at least one
            shard ? I guess that the first one is enough, but just to be
        sure...

            My proposal was to implement a new xattr, for example
            glusterfs.layout, that contains enough information to be
        usable in
            all current use cases.


        Actually pathinfo is supposed to give this information and it
        already
        has the following format: for a 5x2 distributed-replicate volume


    Yes, I know. I wanted to unify all information.


        root@dhcp35-190 - /mnt/v3
        13:38:12 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d
        # file: d
        trusted.glusterfs.pathinfo="((<DISTRIBUTE:v3-dht>
        (<REPLICATE:v3-replicate-0>
        <POSIX(/home/gfs/v3_0):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_0/d>
        <POSIX(/home/gfs/v3_1):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_1/d>)
        (<REPLICATE:v3-replicate-2>
        <POSIX(/home/gfs/v3_5):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_5/d>
        <POSIX(/home/gfs/v3_4):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_4/d>)
        (<REPLICATE:v3-replicate-1>
        <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d>
        <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d>)
        (<REPLICATE:v3-replicate-4>
        <POSIX(/home/gfs/v3_8):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_8/d>
        <POSIX(/home/gfs/v3_9):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_9/d>)
        (<REPLICATE:v3-replicate-3>
        <POSIX(/home/gfs/v3_6):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_6/d>
        <POSIX(/home/gfs/v3_7):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_7/d>))
        (v3-dht-layout (v3-replicate-0 0 858993458) (v3-replicate-1
        858993459
        1717986917) (v3-replicate-2 1717986918 2576980376) (v3-replicate-3
        2576980377 3435973835) (v3-replicate-4 3435973836 4294967295)))"


        root@dhcp35-190 - /mnt/v3
        13:38:26 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d/a
        # file: d/a
        trusted.glusterfs.pathinfo="(<DISTRIBUTE:v3-dht>
        (<REPLICATE:v3-replicate-1>
        <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d/a>
        <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d/a>))"




            The idea would be that each xlator that makes a significant
        change
            in the way or the place where files are stored, should put
            information in this xattr. The information should include:

            * Type (basically AFR, EC, DHT, ...)
            * Basic configuration (replication and arbiter for AFR, data and
            redundancy for EC, # subvolumes for DHT, shard size for
        sharding, ...)
            * Quorum imposed by the xlator
            * UUID data comming from subvolumes (sorted by brick position)
            * It should be easily extensible in the future

            The last point is very important to avoid the issues we have
        seen
            now. We must be able to incorporate more information without
            breaking backward compatibility. To do so, we can add tags
        for each
            value.

            For example, a distribute 2, replica 2 volume with 1 arbiter
        should
            be represented by this string:

               DHT[dist=2,quorum=1](
                  AFR[rep=2,arbiter=1,quorum=2](
                     NODE[quorum=2,uuid=<UUID1>](<path1>),
                     NODE[quorum=2,uuid=<UUID2>](<path2>),
                     NODE[quorum=2,uuid=<UUID3>](<path3>)
                  ),
                  AFR[rep=2,arbiter=1,quorum=2](
                     NODE[quorum=2,uuid=<UUID4>](<path4>),
                     NODE[quorum=2,uuid=<UUID5>](<path5>),
                     NODE[quorum=2,uuid=<UUID6>](<path6>)
                  )
               )

Yes, this looks simpler for now.
 

            Some explanations:

            AFAIK DHT doesn't have quorum, so the default is '1'. We may
        decide
            to omit it when it's '1' for any xlator.

            Quorum in AFR represents client-side enforced quorum. Quorum
        in NODE
            represents the server-side enforced quorum.

            The <path> shown in each NODE represents the physical
        location of
            the file (similar to current glusterfs.pathinfo) because
        this xattr
            can be retrieved for a particular file using getxattr. This
        is nice,
            but we can remove it for now if it's difficult to implement.

            We can decide to have a verbose string or try to omit some
        fields
            when not strictly necessary. For example, if there are no
        arbiters,
            we can omit the 'arbiter' tag instead of writing 'arbiter=0'. We
            could also implicitly compute 'dist' and 'rep' from the
        number of
            elements contained between '()'.

            What do you think ?


        Quite a few people are already familiar with path-info. So I am
        of the
        opinion that we give this information for that xattr itself.
        This xattr
        hasn't changed after quorum/arbiter/shard came in, so may be
        they should?


    Not sure how easy would it be to change the format of path-info to
    incorporate the new information without breaking existing features
    or even user scripts based on it. Maybe a new xattr would be easier
    to implement and adapt.


Probably.



    I missed one important thing in the format: an xlator may have
    per-subvolume information. This information can be placed just
    before each subvolume information:

       DHT[dist=2,quorum=1](
          [hash-range=0x00000000-0x7fffffff]AFR[...](...),
          [hash-range=0x80000000-0xffffffff]AFR[...](...)
       )


Yes, makes sense.

In general I am better at solving problems someone faces, because things
will be more concrete. Do you think it is better to wait until the first
consumer of this functionality comes along and gives their inputs about
what would be nice to have vs must have? At the moment I am not sure how
to distinguish what must be there vs what is nice to have :-(.

The good thing is that using this format we can easily start with bare minimum information, like this:

   DHT(
      AFR(
         NODE[uuid=<UUID1>],
         NODE[uuid=<UUID2>],
         NODE[uuid=<UUID3>]
      ),
      AFR(
         NODE[uuid=<UUID1>],
         NODE[uuid=<UUID2>],
         NODE[uuid=<UUID3>]
      )
   )

And add more information as it is needed, since it won't break backward compatibility.

Xavi



    Xavi




            Xavi




                    Xavi




                        On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
                        <ksubrahm@xxxxxxxxxx
        <mailto:ksubrahm@xxxxxxxxxx> <mailto:ksubrahm@xxxxxxxxxx
        <mailto:ksubrahm@xxxxxxxxxx>>
                <mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>
        <mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>>>
                        <mailto:ksubrahm@xxxxxxxxxx
        <mailto:ksubrahm@xxxxxxxxxx> <mailto:ksubrahm@xxxxxxxxxx
        <mailto:ksubrahm@xxxxxxxxxx>>
                <mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>
        <mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>>>>> wrote:



                            On Wed, Jun 21, 2017 at 1:56 PM, Xavier
        Hernandez
                            <xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>
                        <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>>
                        wrote:

                                That's ok. I'm currently unable to write
        a patch for
                        this on ec.

                            Sunil is working on this patch.

                            ~Karthik

                                If no one can do it, I can try to do it
        in 6 - 7
                hours...

                                Xavi


                                On Wednesday, June 21, 2017 09:48 CEST,
        Pranith
                Kumar
                        Karampuri
                                <pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>> <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>>
                        <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx> <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>>
                <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>
        <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>>>> wrote:



                                    On Wed, Jun 21, 2017 at 1:00 PM, Xavier
                Hernandez
                                    <xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                            <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                            <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>> wrote:

                                        I'm ok with reverting node-uuid
        content
                to the
                            previous
                                        format and create a new xattr
        for the
                new format.
                                        Currently, only rebalance will
        use it.

                                        Only thing to consider is what can
                happen if we
                            have a
                                        half upgraded cluster where some
        clients
                have
                            this change
                                        and some not. Can rebalance work
        in this
                            situation ? if
                                        so, could there be any issue ?


                                    I think there shouldn't be any problem,
                because this is
                                    in-memory xattr so layers below
        afr/ec will
                only see
                            node-uuid
                                    xattr.
                                    This also gives us a chance to do
        whatever
                we want
                            to do in
                                    future with this xattr without any
        problems
                about
                            backward
                                    compatibility.

                                    You can check



        https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>

        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>


        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>

        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>



        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>

        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>


        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>

        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
        <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>>
                                    for how karthik implemented this in AFR
                (this got merged
                                    accidentally yesterday, but looks
        like this
                is what
                            we are
                                    settling on)



                                        Xavi


                                        On Wednesday, June 21, 2017
        06:56 CEST,
                Pranith
                            Kumar
                                        Karampuri <pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>
                            <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>>
                                        <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>
                            <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>
                <mailto:pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>>>>> wrote:



                                            On Wed, Jun 21, 2017 at
        10:07 AM, Nithya
                                Balachandran
                                            <nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>
                <mailto:nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>>
                                <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>
                <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>>> <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>
                <mailto:nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>>
                                <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>
                <mailto:nbalacha@xxxxxxxxxx
        <mailto:nbalacha@xxxxxxxxxx>>>>> wrote:


                                                On 20 June 2017 at
        20:38, Aravinda
                                                <avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
                                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>>> <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
                                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>>>>> wrote:

                                                    On 06/20/2017 06:02
        PM, Pranith
                                Kumar Karampuri
                                                    wrote:

                                                        Xavi, Aravinda
        and I had a
                                    discussion on
                                                        #gluster-dev and we
                agreed to go
                                    with the format
                                                        Aravinda
        suggested for
                now and
                                    in future we
                                                        wanted some more
        changes
                for dht
                                    to detect which
                                                        subvolume went
        down came
                back
                                    up, at that time
                                                        we will revisit
        the solution
                                    suggested by Xavi.

                                                        Susanth is doing
        the dht
                changes
                                                        Aravinda is doing
                geo-rep changes

                                                    Done. Geo-rep patch
        sent for
                review

                https://review.gluster.org/17582
        <https://review.gluster.org/17582>
        <https://review.gluster.org/17582
        <https://review.gluster.org/17582>>
                                <https://review.gluster.org/17582
        <https://review.gluster.org/17582>
                <https://review.gluster.org/17582
        <https://review.gluster.org/17582>>>

                <https://review.gluster.org/17582
        <https://review.gluster.org/17582>
        <https://review.gluster.org/17582
        <https://review.gluster.org/17582>>
                                <https://review.gluster.org/17582
        <https://review.gluster.org/17582>
                <https://review.gluster.org/17582
        <https://review.gluster.org/17582>>>>



                                                The proposed changes to the
                node-uuid
                                behaviour
                                                (while good) are going
        to break
                tiering
                                . Tiering
                                                changes will take a
        little more
                time to
                                be coded and
                                                tested.

                                                As this is a regression
        for 3.11
                and a
                                blocker for
                                                3.11.1, I suggest we go
        back to
                the original
                                                node-uuid behaviour for
        now so as to
                                unblock the
                                                release and target the
        proposed
                changes
                                for the next
                                                3.11 releases.


                                            Let me see if I understand
        the changes
                                correctly. We are
                                            restoring the behavior of
        node-uuid
                xattr
                                and adding a
                                            new xattr for parallel
        rebalance for
                both
                                afr and ec,
                                            correct? Otherwise that is
        one more
                                regression. If yes,
                                            we will also wait for Xavi's
        inputs.
                Jeff
                                accidentally
                                            merged the afr patch
        yesterday which
                does
                                these changes.
                                            If everyone is in agreement,
        we will
                leave
                                it as is and
                                            add similar changes in ec as
        well.
                If we are
                                not in
                                            agreement, then we will let the
                discussion
                                progress :-)




                                                Regards,
                                                Nithya

                                                    --
                                                    Aravinda


                                                        Thanks to all of you
                guys for
                                    the discussions!

                                                        On Tue, Jun 20,
        2017 at
                5:05 PM,
                                    Xavier
                                                        Hernandez
                <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
        <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>

                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx> <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
                                    <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>
                <mailto:xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>>>> wrote:

                                                            Hi Aravinda,

                                                            On 20/06/17
        12:42,
                Aravinda
                                    wrote:

                                                                I think
                following format
                                    can be easily
                                                                adopted
        by all
                components

                                                                UUIDs of a
                subvolume are
                                    seperated by
                                                                space and
                subvolumes are
                                    separated
                                                                by comma

                                                                For example,
                node1 and
                                    node2 are replica
                                                                with U1
        and U2 UUIDs

        respectively and
                                                                node3
        and node4 are
                                    replica with U3 and
                                                                U4 UUIDs
                respectively


        node-uuid can
                return "U1
                                    U2,U3 U4"


                                                            While this
        is ok for
                current
                                    implementation,
                                                            I think this
        can be
                                    insufficient if there
                                                            are more
        layers of
                xlators
                                    that require to
                                                            indicate
        some sort of
                                    grouping. Some

        representation that can
                                    represent hierarchy
                                                            would be
        better. For
                                    example: "(U1 U2) (U3
                                                            U4)" (we can use
                spaces or
                                    comma as a
                                                            separator).



                                                                Geo-rep can
                split by ","
                                    and then split
                                                                by space and
                take first UUID
                                                                DHT can
        split
                the value
                                    by space or
                                                                comma
        and get unique
                                    UUIDs list


                                                            This doesn't
        solve the
                                    problem I described
                                                            in the previous
                email. Some
                                    more logic will
                                                            need to be
        added to
                avoid
                                    more than one node
                                                            from each
                replica-set to be
                                    active. If we
                                                            have some
        explicit
                hierarchy
                                    information in
                                                            the
        node-uuid value,
                more
                                    decisions can be
                                                            taken.

                                                            An initial
        proposal
                I made
                                    was this:


                DHT[2](AFR[2,0](NODE(U1),
                                    NODE(U2)),

        AFR[2,0](NODE(U1),
                NODE(U2)))

                                                            This is
        harder to
                parse, but
                                    gives a lot of
                                                            information:
        DHT with 2
                                    subvolumes, each
                                                            subvolume is
        an AFR with
                                    replica 2 and no
                                                            arbiters.
        It's also
                easily
                                    extensible with
                                                            any new
        xlator that
                changes
                                    the layout.

                                                            However
        maybe this
                is not
                                    the moment to do
                                                            this, and
        probably
                we could
                                    implement this
                                                            in a new
        xattr with
                a better
                                    name.

                                                            Xavi



                                                                Another
        question is
                                    about the behavior
                                                                when a
        node is down,
                                    existing

        node-uuid xattr
                will not
                                    return that
                                                                UUID if
        a node
                is down.
                                    What is the
                                                                behavior
        with the
                                    proposed xattr?

                                                                Let me
        know your
                thoughts.

                                                                regards
                                                                Aravinda VK

                                                                On
        06/20/2017
                03:06 PM,
                                    Aravinda wrote:

                                                                    Hi Xavi,

                                                                    On
                06/20/2017 02:51
                                    PM, Xavier

        Hernandez wrote:


        Hi Aravinda,


        On 20/06/17
                                    11:05, Pranith Kumar

                Karampuri wrote:


                Adding more
                                    people to get a

                consensus
                                    about this.


            On
                Tue, Jun
                                    20, 2017 at 1:49

            PM,
                Aravinda

                                    <avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>> <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>>

                                    <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
                                    <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>>>

                                    <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>> <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>>

                                    <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
                                    <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>
                <mailto:avishwan@xxxxxxxxxx
        <mailto:avishwan@xxxxxxxxxx>>>>>>

            wrote:



                regards

                Aravinda VK



                On
                                    06/20/2017 01:26 PM,



--
Pranith
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux