Thanks to all of you guys for the discussions!
On Tue, Jun 20, 2017 at 5:05 PM, Xavier
Hernandez <xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> wrote:
Hi Aravinda,
On 20/06/17 12:42, Aravinda wrote:
I think following format can be easily
adopted by all components
UUIDs of a subvolume are seperated by
space and subvolumes are separated
by comma
For example, node1 and node2 are replica
with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and
U4 UUIDs respectively
node-uuid can return "U1 U2,U3 U4"
While this is ok for current implementation,
I think this can be insufficient if there
are more layers of xlators that require to
indicate some sort of grouping. Some
representation that can represent hierarchy
would be better. For example: "(U1 U2) (U3
U4)" (we can use spaces or comma as a
separator).
Geo-rep can split by "," and then split
by space and take first UUID
DHT can split the value by space or
comma and get unique UUIDs list
This doesn't solve the problem I described
in the previous email. Some more logic will
need to be added to avoid more than one node
from each replica-set to be active. If we
have some explicit hierarchy information in
the node-uuid value, more decisions can be
taken.
An initial proposal I made was this:
DHT[2](AFR[2,0](NODE(U1), NODE(U2)),
AFR[2,0](NODE(U1), NODE(U2)))
This is harder to parse, but gives a lot of
information: DHT with 2 subvolumes, each
subvolume is an AFR with replica 2 and no
arbiters. It's also easily extensible with
any new xlator that changes the layout.
However maybe this is not the moment to do
this, and probably we could implement this
in a new xattr with a better name.
Xavi
Another question is about the behavior
when a node is down, existing
node-uuid xattr will not return that
UUID if a node is down. What is the
behavior with the proposed xattr?
Let me know your thoughts.
regards
Aravinda VK
On 06/20/2017 03:06 PM, Aravinda wrote:
Hi Xavi,
On 06/20/2017 02:51 PM, Xavier
Hernandez wrote:
Hi Aravinda,
On 20/06/17 11:05, Pranith Kumar
Karampuri wrote:
Adding more people to get a
consensus about this.
On Tue, Jun 20, 2017 at 1:49
PM, Aravinda
<avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>>>
wrote:
regards
Aravinda VK
On 06/20/2017 01:26 PM,
Xavier Hernandez wrote:
Hi Pranith,
adding
gluster-devel, Kotresh and
Aravinda,
On 20/06/17 09:45,
Pranith Kumar Karampuri wrote:
On Tue, Jun 20,
2017 at 1:12 PM, Xavier
Hernandez
<xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>>
wrote:
On 20/06/17
09:31, Pranith Kumar
Karampuri wrote:
The way
geo-replication works is:
On each
machine, it does getxattr of
node-uuid and
check if its
own uuid
is
present in the list. If it
is present then it
will consider
it active
otherwise it will be
considered passive. With this
change we are
giving
all
uuids instead of first-up
subvolume. So all
machines think
they are
ACTIVE
which is bad apparently. So
that is the
reason. Even I
felt bad
that we
are doing this change.
And what
about changing the content
of node-uuid to
include some
sort of
hierarchy ?
for example:
a single brick:
NODE(<guid>)
AFR/EC:
AFR[2](NODE(<guid>),
NODE(<guid>))
EC[3,1](NODE(<guid>),
NODE(<guid>), NODE(<guid>))
DHT:
DHT[2](AFR[2](NODE(<guid>),
NODE(<guid>)),
AFR[2](NODE(<guid>),
NODE(<guid>)))
This gives a
lot of information that can
be used to
take the
appropriate
decisions.
I guess that is
not backward compatible.
Shall I CC
gluster-devel and
Kotresh/Aravinda?
Is the change we did
backward compatible ? if we
only require
the first field to
be a GUID to support
backward compatibility,
we can use something
like this:
No. But the necessary
change can be made to
Geo-rep code as well if
format is changed, Since
all these are built/shipped
together.
Geo-rep uses node-id as
follows,
list = listxattr(node-uuid)
active_node_uuids =
list.split(SPACE)
active_node_flag = True
if self.node_id exists in
active_node_uuids
else False
How was this case solved ?
suppose we have three servers
and 2 bricks in each server. A
replicated volume is created
using the following command:
gluster volume create test
replica 2 server1:/brick1
server2:/brick1
server2:/brick2 server3:/brick1
server3:/brick1 server1:/brick2
In this case we have three
replica-sets:
* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2
Old AFR implementation for
node-uuid always returned the
uuid of the
node of the first brick, so in
this case we will get the uuid
of the
three nodes because all of them
are the first brick of a
replica-set.
Does this mean that with this
configuration all nodes are
active ? Is
this a problem ? Is there any
other check to avoid this
situation if
it's not good ?
Yes all Geo-rep workers will become
Active and participate in syncing.
Since changelogs will have the same
information in replica bricks this
will lead to duplicate syncing and
consuming network bandwidth.
Node-uuid based Active worker is the
default configuration in Geo-rep
till now, Geo-rep also has Meta
Volume based syncronization for Active
worker using lock files.(Can be
opted using Geo-rep configuration,
with this config node-uuid will not
be used)
Kotresh proposed a solution to
configure which worker to become
Active. This will give more control
to Admin to choose Active workers,
This will become default
configuration from 3.12
https://github.com/gluster/glusterfs/issues/244
<https://github.com/gluster/glusterfs/issues/244>
--
Aravinda
Xavi
Bricks:
<guid>
AFR/EC:
<guid>(<guid>, <guid>)
DHT:
<guid>(<guid>(<guid>, ...),
<guid>(<guid>, ...))
In this case, AFR
and EC would return the same
<guid> they
returned before the
patch, but between '(' and
')' they put the
full list of guid's
of all nodes. The first
<guid> can be used
by geo-replication.
The list after the first
<guid> can be used
for rebalance.
Not sure if there's
any user of node-uuid above DHT.
Xavi
Xavi
On Tue,
Jun 20, 2017 at 12:46 PM,
Xavier Hernandez
<xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>>>
wrote:
Hi
Pranith,
On
20/06/17 07:53, Pranith
Kumar Karampuri
wrote:
hi Xavi,
We all made the
mistake of not
sending about
changing
behavior of
node-uuid xattr so that
rebalance can use
multiple nodes
for doing
rebalance. Because of this
on geo-rep all
the workers
are becoming
active instead of one per
EC/AFR subvolume.
So we are
frantically trying
to restore the functionality
of node-uuid
and introduce
a new
xattr for
the new behavior. Sunil will
be sending out
a patch for
this.
Wouldn't it be better to
change geo-rep
behavior
to use the
new data
? I
think it's better as it's
now, since it
gives more
information
to
upper layers so that they
can take more
accurate decisions.
Xavi
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith