On 07/07/17 11:25, Pranith Kumar Karampuri wrote:
On Fri, Jul 7, 2017 at 2:46 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> wrote:
On 07/07/17 10:12, Pranith Kumar Karampuri wrote:
On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez
<xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>
wrote:
Hi Pranith,
On 05/07/17 12:28, Pranith Kumar Karampuri wrote:
On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez
<xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx> <mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>>
wrote:
Hi Pranith,
On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
Xavi,
Now that the change has been reverted, we can
resume this
discussion and decide on the exact format that
considers, tier, dht,
afr, ec. People working geo-rep/dht/afr/ec had
an internal
discussion
and we all agreed that this proposal would be a
good way
forward. I
think once we agree on the format and decide on
the initial
encoding/decoding functions of the xattr and
this change is
merged, we
can send patches on afr/ec/dht and geo-rep to
take it to
closure.
Could you propose the new format you have in
mind that
considers
all of
the xlators?
My idea was to create a new xattr not bound to any
particular
function but which could give enough information to
be used
in many
places.
Currently we have another attribute called
glusterfs.pathinfo that
returns hierarchical information about the location of a
file. Maybe
we can extend this to unify all these attributes
into a single
feature that could be used for multiple purposes.
Since we have time to discuss it, I would like to
design it with
more information than we already talked.
First of all, the amount of information that this
attribute can
contain is quite big if we expect to have volumes with
thousands of
bricks. Even in the most simple case of returning
only an
UUID, we
can easily go beyond the limit of 64KB.
Consider also, for example, what shard should return
when
pathinfo
is requested for a file. Probably it should return a
list of
shards,
each one with all its associated pathinfo. We are
talking
about big
amounts of data here.
I think this kind of information doesn't fit very
well in an
extended attribute. Another think to consider is
that most
probably
the requester of the data only needs a fragment of
it, so we are
generating big amounts of data only to be parsed and
reduced
later,
dismissing most of it.
What do you think about using a very special virtual
file to
manage
all this information ? it could be easily read using
normal read
fops, so it could manage big amounts of data easily.
Also,
accessing
only to some parts of the file we could go directly
where we
want,
avoiding the read of all remaining data.
A very basic idea could be this:
Each xlator would have a reserved area of the file.
We can
reserve
up to 4GB per xlator (32 bits). The remaining 32
bits of the
offset
would indicate the xlator we want to access.
At offset 0 we have generic information about the
volume.
One of the
the things that this information should include is a
basic
hierarchy
of the whole volume and the offset for each xlator.
After reading this, the user will seek to the
desired offset and
read the information related to the xlator it is
interested in.
All the information should be stored in a format easily
extensible
that will be kept compatible even if new information is
added in the
future (for example doing special mappings of the 32
bits
offsets
reserved for the xlator).
For example we can reserve the first megabyte of the
xlator
area to
have a mapping of attributes with its respective offset.
I think that using a binary format would simplify
all this a
lot.
Do you think this is a way to explore or should I stop
wasting time
here ?
I think this just became a very big feature :-). Shall
we just
live with
it the way it is now?
I supposed it...
Only thing we need to check is if shard needs to handle this
xattr.
If so, what it should return ? only the UUID's corresponding
to the
first shard or the UUID's of all bricks containing at least one
shard ? I guess that the first one is enough, but just to be
sure...
My proposal was to implement a new xattr, for example
glusterfs.layout, that contains enough information to be
usable in
all current use cases.
Actually pathinfo is supposed to give this information and it
already
has the following format: for a 5x2 distributed-replicate volume
Yes, I know. I wanted to unify all information.
root@dhcp35-190 - /mnt/v3
13:38:12 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d
# file: d
trusted.glusterfs.pathinfo="((<DISTRIBUTE:v3-dht>
(<REPLICATE:v3-replicate-0>
<POSIX(/home/gfs/v3_0):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_0/d>
<POSIX(/home/gfs/v3_1):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_1/d>)
(<REPLICATE:v3-replicate-2>
<POSIX(/home/gfs/v3_5):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_5/d>
<POSIX(/home/gfs/v3_4):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_4/d>)
(<REPLICATE:v3-replicate-1>
<POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d>
<POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d>)
(<REPLICATE:v3-replicate-4>
<POSIX(/home/gfs/v3_8):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_8/d>
<POSIX(/home/gfs/v3_9):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_9/d>)
(<REPLICATE:v3-replicate-3>
<POSIX(/home/gfs/v3_6):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_6/d>
<POSIX(/home/gfs/v3_7):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_7/d>))
(v3-dht-layout (v3-replicate-0 0 858993458) (v3-replicate-1
858993459
1717986917) (v3-replicate-2 1717986918 2576980376) (v3-replicate-3
2576980377 3435973835) (v3-replicate-4 3435973836 4294967295)))"
root@dhcp35-190 - /mnt/v3
13:38:26 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d/a
# file: d/a
trusted.glusterfs.pathinfo="(<DISTRIBUTE:v3-dht>
(<REPLICATE:v3-replicate-1>
<POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d/a>
<POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d/a>))"
The idea would be that each xlator that makes a significant
change
in the way or the place where files are stored, should put
information in this xattr. The information should include:
* Type (basically AFR, EC, DHT, ...)
* Basic configuration (replication and arbiter for AFR, data and
redundancy for EC, # subvolumes for DHT, shard size for
sharding, ...)
* Quorum imposed by the xlator
* UUID data comming from subvolumes (sorted by brick position)
* It should be easily extensible in the future
The last point is very important to avoid the issues we have
seen
now. We must be able to incorporate more information without
breaking backward compatibility. To do so, we can add tags
for each
value.
For example, a distribute 2, replica 2 volume with 1 arbiter
should
be represented by this string:
DHT[dist=2,quorum=1](
AFR[rep=2,arbiter=1,quorum=2](
NODE[quorum=2,uuid=<UUID1>](<path1>),
NODE[quorum=2,uuid=<UUID2>](<path2>),
NODE[quorum=2,uuid=<UUID3>](<path3>)
),
AFR[rep=2,arbiter=1,quorum=2](
NODE[quorum=2,uuid=<UUID4>](<path4>),
NODE[quorum=2,uuid=<UUID5>](<path5>),
NODE[quorum=2,uuid=<UUID6>](<path6>)
)
)
Some explanations:
AFAIK DHT doesn't have quorum, so the default is '1'. We may
decide
to omit it when it's '1' for any xlator.
Quorum in AFR represents client-side enforced quorum. Quorum
in NODE
represents the server-side enforced quorum.
The <path> shown in each NODE represents the physical
location of
the file (similar to current glusterfs.pathinfo) because
this xattr
can be retrieved for a particular file using getxattr. This
is nice,
but we can remove it for now if it's difficult to implement.
We can decide to have a verbose string or try to omit some
fields
when not strictly necessary. For example, if there are no
arbiters,
we can omit the 'arbiter' tag instead of writing 'arbiter=0'. We
could also implicitly compute 'dist' and 'rep' from the
number of
elements contained between '()'.
What do you think ?
Quite a few people are already familiar with path-info. So I am
of the
opinion that we give this information for that xattr itself.
This xattr
hasn't changed after quorum/arbiter/shard came in, so may be
they should?
Not sure how easy would it be to change the format of path-info to
incorporate the new information without breaking existing features
or even user scripts based on it. Maybe a new xattr would be easier
to implement and adapt.
Probably.
I missed one important thing in the format: an xlator may have
per-subvolume information. This information can be placed just
before each subvolume information:
DHT[dist=2,quorum=1](
[hash-range=0x00000000-0x7fffffff]AFR[...](...),
[hash-range=0x80000000-0xffffffff]AFR[...](...)
)
Yes, makes sense.
In general I am better at solving problems someone faces, because things
will be more concrete. Do you think it is better to wait until the first
consumer of this functionality comes along and gives their inputs about
what would be nice to have vs must have? At the moment I am not sure how
to distinguish what must be there vs what is nice to have :-(.
The good thing is that using this format we can easily start with bare
minimum information, like this:
DHT(
AFR(
NODE[uuid=<UUID1>],
NODE[uuid=<UUID2>],
NODE[uuid=<UUID3>]
),
AFR(
NODE[uuid=<UUID1>],
NODE[uuid=<UUID2>],
NODE[uuid=<UUID3>]
)
)
And add more information as it is needed, since it won't break backward
compatibility.
Xavi
Xavi
Xavi
Xavi
On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
<ksubrahm@xxxxxxxxxx
<mailto:ksubrahm@xxxxxxxxxx> <mailto:ksubrahm@xxxxxxxxxx
<mailto:ksubrahm@xxxxxxxxxx>>
<mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>
<mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>>>
<mailto:ksubrahm@xxxxxxxxxx
<mailto:ksubrahm@xxxxxxxxxx> <mailto:ksubrahm@xxxxxxxxxx
<mailto:ksubrahm@xxxxxxxxxx>>
<mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>
<mailto:ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>>>>> wrote:
On Wed, Jun 21, 2017 at 1:56 PM, Xavier
Hernandez
<xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> <mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> <mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>>>
wrote:
That's ok. I'm currently unable to write
a patch for
this on ec.
Sunil is working on this patch.
~Karthik
If no one can do it, I can try to do it
in 6 - 7
hours...
Xavi
On Wednesday, June 21, 2017 09:48 CEST,
Pranith
Kumar
Karampuri
<pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx>
<mailto:pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx>> <mailto:pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx>
<mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>>
<mailto:pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx> <mailto:pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx>>
<mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>
<mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>>>> wrote:
On Wed, Jun 21, 2017 at 1:00 PM, Xavier
Hernandez
<xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>> <mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>>> wrote:
I'm ok with reverting node-uuid
content
to the
previous
format and create a new xattr
for the
new format.
Currently, only rebalance will
use it.
Only thing to consider is what can
happen if we
have a
half upgraded cluster where some
clients
have
this change
and some not. Can rebalance work
in this
situation ? if
so, could there be any issue ?
I think there shouldn't be any problem,
because this is
in-memory xattr so layers below
afr/ec will
only see
node-uuid
xattr.
This also gives us a chance to do
whatever
we want
to do in
future with this xattr without any
problems
about
backward
compatibility.
You can check
https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>>
for how karthik implemented this in AFR
(this got merged
accidentally yesterday, but looks
like this
is what
we are
settling on)
Xavi
On Wednesday, June 21, 2017
06:56 CEST,
Pranith
Kumar
Karampuri <pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx>
<mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>
<mailto:pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx>
<mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>>
<mailto:pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx>
<mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>
<mailto:pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx>
<mailto:pkarampu@xxxxxxxxxx
<mailto:pkarampu@xxxxxxxxxx>>>>> wrote:
On Wed, Jun 21, 2017 at
10:07 AM, Nithya
Balachandran
<nbalacha@xxxxxxxxxx
<mailto:nbalacha@xxxxxxxxxx>
<mailto:nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>>
<mailto:nbalacha@xxxxxxxxxx
<mailto:nbalacha@xxxxxxxxxx>
<mailto:nbalacha@xxxxxxxxxx
<mailto:nbalacha@xxxxxxxxxx>>> <mailto:nbalacha@xxxxxxxxxx
<mailto:nbalacha@xxxxxxxxxx>
<mailto:nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>>
<mailto:nbalacha@xxxxxxxxxx
<mailto:nbalacha@xxxxxxxxxx>
<mailto:nbalacha@xxxxxxxxxx
<mailto:nbalacha@xxxxxxxxxx>>>>> wrote:
On 20 June 2017 at
20:38, Aravinda
<avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>>> <mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>>>>> wrote:
On 06/20/2017 06:02
PM, Pranith
Kumar Karampuri
wrote:
Xavi, Aravinda
and I had a
discussion on
#gluster-dev and we
agreed to go
with the format
Aravinda
suggested for
now and
in future we
wanted some more
changes
for dht
to detect which
subvolume went
down came
back
up, at that time
we will revisit
the solution
suggested by Xavi.
Susanth is doing
the dht
changes
Aravinda is doing
geo-rep changes
Done. Geo-rep patch
sent for
review
https://review.gluster.org/17582
<https://review.gluster.org/17582>
<https://review.gluster.org/17582
<https://review.gluster.org/17582>>
<https://review.gluster.org/17582
<https://review.gluster.org/17582>
<https://review.gluster.org/17582
<https://review.gluster.org/17582>>>
<https://review.gluster.org/17582
<https://review.gluster.org/17582>
<https://review.gluster.org/17582
<https://review.gluster.org/17582>>
<https://review.gluster.org/17582
<https://review.gluster.org/17582>
<https://review.gluster.org/17582
<https://review.gluster.org/17582>>>>
The proposed changes to the
node-uuid
behaviour
(while good) are going
to break
tiering
. Tiering
changes will take a
little more
time to
be coded and
tested.
As this is a regression
for 3.11
and a
blocker for
3.11.1, I suggest we go
back to
the original
node-uuid behaviour for
now so as to
unblock the
release and target the
proposed
changes
for the next
3.11 releases.
Let me see if I understand
the changes
correctly. We are
restoring the behavior of
node-uuid
xattr
and adding a
new xattr for parallel
rebalance for
both
afr and ec,
correct? Otherwise that is
one more
regression. If yes,
we will also wait for Xavi's
inputs.
Jeff
accidentally
merged the afr patch
yesterday which
does
these changes.
If everyone is in agreement,
we will
leave
it as is and
add similar changes in ec as
well.
If we are
not in
agreement, then we will let the
discussion
progress :-)
Regards,
Nithya
--
Aravinda
Thanks to all of you
guys for
the discussions!
On Tue, Jun 20,
2017 at
5:05 PM,
Xavier
Hernandez
<xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx> <mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>>> wrote:
Hi Aravinda,
On 20/06/17
12:42,
Aravinda
wrote:
I think
following format
can be easily
adopted
by all
components
UUIDs of a
subvolume are
seperated by
space and
subvolumes are
separated
by comma
For example,
node1 and
node2 are replica
with U1
and U2 UUIDs
respectively and
node3
and node4 are
replica with U3 and
U4 UUIDs
respectively
node-uuid can
return "U1
U2,U3 U4"
While this
is ok for
current
implementation,
I think this
can be
insufficient if there
are more
layers of
xlators
that require to
indicate
some sort of
grouping. Some
representation that can
represent hierarchy
would be
better. For
example: "(U1 U2) (U3
U4)" (we can use
spaces or
comma as a
separator).
Geo-rep can
split by ","
and then split
by space and
take first UUID
DHT can
split
the value
by space or
comma
and get unique
UUIDs list
This doesn't
solve the
problem I described
in the previous
email. Some
more logic will
need to be
added to
avoid
more than one node
from each
replica-set to be
active. If we
have some
explicit
hierarchy
information in
the
node-uuid value,
more
decisions can be
taken.
An initial
proposal
I made
was this:
DHT[2](AFR[2,0](NODE(U1),
NODE(U2)),
AFR[2,0](NODE(U1),
NODE(U2)))
This is
harder to
parse, but
gives a lot of
information:
DHT with 2
subvolumes, each
subvolume is
an AFR with
replica 2 and no
arbiters.
It's also
easily
extensible with
any new
xlator that
changes
the layout.
However
maybe this
is not
the moment to do
this, and
probably
we could
implement this
in a new
xattr with
a better
name.
Xavi
Another
question is
about the behavior
when a
node is down,
existing
node-uuid xattr
will not
return that
UUID if
a node
is down.
What is the
behavior
with the
proposed xattr?
Let me
know your
thoughts.
regards
Aravinda VK
On
06/20/2017
03:06 PM,
Aravinda wrote:
Hi Xavi,
On
06/20/2017 02:51
PM, Xavier
Hernandez wrote:
Hi Aravinda,
On 20/06/17
11:05, Pranith Kumar
Karampuri wrote:
Adding more
people to get a
consensus
about this.
On
Tue, Jun
20, 2017 at 1:49
PM,
Aravinda
<avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>> <mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>>>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>> <mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>>>>>>
wrote:
regards
Aravinda VK
On
06/20/2017 01:26 PM,
Xavier
Hernandez wrote:
Hi
Pranith,
adding
gluster-devel, Kotresh and
Aravinda,
On
20/06/17 09:45,
Pranith
Kumar Karampuri wrote:
On Tue, Jun 20,
2017
at 1:12
PM, Xavier
Hernandez
<xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> <mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>>>>>>>
wrote:
On 20/06/17
09:31,
Pranith Kumar
Karampuri wrote:
The way
geo-replication works is:
On each
machine, it
does getxattr of
node-uuid and
check if its
own uuid
is
present in
the list. If it
is
present
then it
will consider
it active
otherwise it
will be
considered
passive. With this
change we are
giving
all
uuids
instead of first-up
subvolume.
So all
machines think
they are
ACTIVE
which is bad
apparently. So
that
is the
reason. Even I
felt bad
that we
are
doing
this change.
And what
about
changing the content
of
node-uuid to
include some
sort of
hierarchy ?
for example:
a single brick:
NODE(<guid>)
AFR/EC:
AFR[2](NODE(<guid>),
NODE(<guid>))
EC[3,1](NODE(<guid>),
NODE(<guid>), NODE(<guid>))
DHT:
DHT[2](AFR[2](NODE(<guid>),
NODE(<guid>)),
AFR[2](NODE(<guid>),
NODE(<guid>)))
This gives a
lot of
information that can
be
used to
take the
appropriate
decisions.
I guess that is
not
backward
compatible.
Shall I CC
gluster-devel and
Kotresh/Aravinda?
Is
the change we did
backward
compatible ? if we
only
require
the
first field to
be a
GUID to
support
backward
compatibility,
we
can use something
like
this:
No. But
the necessary
change can
be made to
Geo-rep code
as well if
format
is changed, Since
all
these
are built/shipped
together.
Geo-rep
uses node-id as
follows,
list =
listxattr(node-uuid)
active_node_uuids =
list.split(SPACE)
active_node_flag = True
if
self.node_id exists in
active_node_uuids
else False
How was this
case solved ?
suppose
we have
three servers
and 2
bricks in
each server. A
replicated
volume is created
using the
following command:
gluster
volume
create test
replica 2
server1:/brick1
server2:/brick1
server2:/brick2
server3:/brick1
server3:/brick1
server1:/brick2
In this
case we
have three
replica-sets:
*
server1:/brick1 server2:/brick1
*
server2:/brick2 server3:/brick1
*
server3:/brick2 server2:/brick2
Old AFR
implementation for
node-uuid always
returned the
uuid of the
node of the
first brick, so in
this case we
will get the uuid
of the
three nodes
because all of them
are the
first
brick of a
replica-set.
Does
this mean
that with this
configuration
all nodes are
active ? Is
this a
problem ?
Is there any
other
check to
avoid this
situation if
it's not
good ?
Yes
all Geo-rep
workers will become
Active and
participate in syncing.
Since changelogs
will have the same
information in
replica bricks this
will
lead to
duplicate syncing and
consuming
network
bandwidth.
Node-uuid based
Active worker is the
default
--
Pranith
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel