On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:
Hi Pranith,
On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
Xavi,
Now that the change has been reverted, we can resume this
discussion and decide on the exact format that considers, tier, dht,
afr, ec. People working geo-rep/dht/afr/ec had an internal discussion
and we all agreed that this proposal would be a good way forward. I
think once we agree on the format and decide on the initial
encoding/decoding functions of the xattr and this change is merged, we
can send patches on afr/ec/dht and geo-rep to take it to closure.
Could you propose the new format you have in mind that considers all of
the xlators?
My idea was to create a new xattr not bound to any particular function but which could give enough information to be used in many places.
Currently we have another attribute called glusterfs.pathinfo that returns hierarchical information about the location of a file. Maybe we can extend this to unify all these attributes into a single feature that could be used for multiple purposes.
Since we have time to discuss it, I would like to design it with more information than we already talked.
First of all, the amount of information that this attribute can contain is quite big if we expect to have volumes with thousands of bricks. Even in the most simple case of returning only an UUID, we can easily go beyond the limit of 64KB.
Consider also, for example, what shard should return when pathinfo is requested for a file. Probably it should return a list of shards, each one with all its associated pathinfo. We are talking about big amounts of data here.
I think this kind of information doesn't fit very well in an extended attribute. Another think to consider is that most probably the requester of the data only needs a fragment of it, so we are generating big amounts of data only to be parsed and reduced later, dismissing most of it.
What do you think about using a very special virtual file to manage all this information ? it could be easily read using normal read fops, so it could manage big amounts of data easily. Also, accessing only to some parts of the file we could go directly where we want, avoiding the read of all remaining data.
A very basic idea could be this:
Each xlator would have a reserved area of the file. We can reserve up to 4GB per xlator (32 bits). The remaining 32 bits of the offset would indicate the xlator we want to access.
At offset 0 we have generic information about the volume. One of the the things that this information should include is a basic hierarchy of the whole volume and the offset for each xlator.
After reading this, the user will seek to the desired offset and read the information related to the xlator it is interested in.
All the information should be stored in a format easily extensible that will be kept compatible even if new information is added in the future (for example doing special mappings of the 32 bits offsets reserved for the xlator).
For example we can reserve the first megabyte of the xlator area to have a mapping of attributes with its respective offset.
I think that using a binary format would simplify all this a lot.
Do you think this is a way to explore or should I stop wasting time here ?
I think this just became a very big feature :-). Shall we just live with it the way it is now?
Xavi
On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
<ksubrahm@xxxxxxxxxx <mailto:ksubrahm@xxxxxxxxxx>> wrote:
On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez
<xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote: <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>> wrote:
That's ok. I'm currently unable to write a patch for this on ec.
Sunil is working on this patch.
~Karthik
If no one can do it, I can try to do it in 6 - 7 hours...
Xavi
On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri
On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez
<xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote: <mailto:pkarampu@xxxxxxxxxx>> wrote:
I'm ok with reverting node-uuid content to the previous
format and create a new xattr for the new format.
Currently, only rebalance will use it.
Only thing to consider is what can happen if we have a
half upgraded cluster where some clients have this change
and some not. Can rebalance work in this situation ? if
so, could there be any issue ?
I think there shouldn't be any problem, because this is
in-memory xattr so layers below afr/ec will only see node-uuid
xattr.
This also gives us a chance to do whatever we want to do in
future with this xattr without any problems about backward
compatibility.
You can check
https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/s rc/afr-inode-read.c@1507
<https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/ >src/afr-inode-read.c@1507
for how karthik implemented this in AFR (this got merged
accidentally yesterday, but looks like this is what we are
settling on)
Xavi
On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar
Karampuri <pkarampu@xxxxxxxxxx
On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran
<nbalacha@xxxxxxxxxx <mailto:nbalacha@xxxxxxxxxx>> wrote:
On 20 June 2017 at 20:38, Aravinda<avishwan@xxxxxxxxxx <mailto:avishwan@xxxxxxxxxx>> wrote:
On 06/20/2017 06:02 PM, Pranith Kumar Karampuri
wrote:
Xavi, Aravinda and I had a discussion onDone. Geo-rep patch sent for review
#gluster-dev and we agreed to go with the format
Aravinda suggested for now and in future we
wanted some more changes for dht to detect which
subvolume went down came back up, at that time
we will revisit the solution suggested by Xavi.
Susanth is doing the dht changes
Aravinda is doing geo-rep changes
https://review.gluster.org/17582
<https://review.gluster.org/17582 >
The proposed changes to the node-uuid behaviour
(while good) are going to break tiering . Tiering
changes will take a little more time to be coded and
tested.
As this is a regression for 3.11 and a blocker for
3.11.1, I suggest we go back to the original
node-uuid behaviour for now so as to unblock the
release and target the proposed changes for the next
3.11 releases.
Let me see if I understand the changes correctly. We are
restoring the behavior of node-uuid xattr and adding a
new xattr for parallel rebalance for both afr and ec,
correct? Otherwise that is one more regression. If yes,
we will also wait for Xavi's inputs. Jeff accidentally
merged the afr patch yesterday which does these changes.
If everyone is in agreement, we will leave it as is and
add similar changes in ec as well. If we are not in
agreement, then we will let the discussion progress :-)
Regards,
Nithya
--
Aravinda
Thanks to all of you guys for the discussions!
On Tue, Jun 20, 2017 at 5:05 PM, Xavier
Hernandez <xhernandez@xxxxxxxxxx<mailto:avishwan@xxxxxxxxxx<mailto:xhernandez@xxxxxxxxxx>> wrote:
Hi Aravinda,
On 20/06/17 12:42, Aravinda wrote:
I think following format can be easily
adopted by all components
UUIDs of a subvolume are seperated by
space and subvolumes are separated
by comma
For example, node1 and node2 are replica
with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and
U4 UUIDs respectively
node-uuid can return "U1 U2,U3 U4"
While this is ok for current implementation,
I think this can be insufficient if there
are more layers of xlators that require to
indicate some sort of grouping. Some
representation that can represent hierarchy
would be better. For example: "(U1 U2) (U3
U4)" (we can use spaces or comma as a
separator).
Geo-rep can split by "," and then split
by space and take first UUID
DHT can split the value by space or
comma and get unique UUIDs list
This doesn't solve the problem I described
in the previous email. Some more logic will
need to be added to avoid more than one node
from each replica-set to be active. If we
have some explicit hierarchy information in
the node-uuid value, more decisions can be
taken.
An initial proposal I made was this:
DHT[2](AFR[2,0](NODE(U1), NODE(U2)),
AFR[2,0](NODE(U1), NODE(U2)))
This is harder to parse, but gives a lot of
information: DHT with 2 subvolumes, each
subvolume is an AFR with replica 2 and no
arbiters. It's also easily extensible with
any new xlator that changes the layout.
However maybe this is not the moment to do
this, and probably we could implement this
in a new xattr with a better name.
Xavi
Another question is about the behavior
when a node is down, existing
node-uuid xattr will not return that
UUID if a node is down. What is the
behavior with the proposed xattr?
Let me know your thoughts.
regards
Aravinda VK
On 06/20/2017 03:06 PM, Aravinda wrote:
Hi Xavi,
On 06/20/2017 02:51 PM, Xavier
Hernandez wrote:
Hi Aravinda,
On 20/06/17 11:05, Pranith Kumar
Karampuri wrote:
Adding more people to get a
consensus about this.
On Tue, Jun 20, 2017 at 1:49
PM, Aravinda
<avishwan@xxxxxxxxxx
<mailto:avishwan@xxxxxxxxxx>
<mailto:avishwan@xxxxxxxxxx>>>
wrote:
regards
Aravinda VK
On 06/20/2017 01:26 PM,
Xavier Hernandez wrote:
Hi Pranith,
adding
gluster-devel, Kotresh and
Aravinda,
On 20/06/17 09:45,
Pranith Kumar Karampuri wrote:
On Tue, Jun 20,
2017 at 1:12 PM, Xavier
Hernandez
<xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> <mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>>
wrote:
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>On 20/06/17
09:31, Pranith Kumar
Karampuri wrote:
The way
geo-replication works is:
On each
machine, it does getxattr of
node-uuid and
check if its
own uuid
is
present in the list. If it
is present then it
will consider
it active
otherwise it will be
considered passive. With this
change we are
giving
all
uuids instead of first-up
subvolume. So all
machines think
they are
ACTIVE
which is bad apparently. So
that is the
reason. Even I
felt bad
that we
are doing this change.
And what
about changing the content
of node-uuid to
include some
sort of
hierarchy ?
for example:
a single brick:
NODE(<guid>)
AFR/EC:
AFR[2](NODE(<guid>),
NODE(<guid>))
EC[3,1](NODE(<guid>),
NODE(<guid>), NODE(<guid>))
DHT:
DHT[2](AFR[2](NODE(<guid>),
NODE(<guid>)),
AFR[2](NODE(<guid>),
NODE(<guid>)))
This gives a
lot of information that can
be used to
take the
appropriate
decisions.
I guess that is
not backward compatible.
Shall I CC
gluster-devel and
Kotresh/Aravinda?
Is the change we did
backward compatible ? if we
only require
the first field to
be a GUID to support
backward compatibility,
we can use something
like this:
No. But the necessary
change can be made to
Geo-rep code as well if
format is changed, Since
all these are built/shipped
together.
Geo-rep uses node-id as
follows,
list = listxattr(node-uuid)
active_node_uuids =
list.split(SPACE)
active_node_flag = True
if self.node_id exists in
active_node_uuids
else False
How was this case solved ?
suppose we have three servers
and 2 bricks in each server. A
replicated volume is created
using the following command:
gluster volume create test
replica 2 server1:/brick1
server2:/brick1
server2:/brick2 server3:/brick1
server3:/brick1 server1:/brick2
In this case we have three
replica-sets:
* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2
Old AFR implementation for
node-uuid always returned the
uuid of the
node of the first brick, so in
this case we will get the uuid
of the
three nodes because all of them
are the first brick of a
replica-set.
Does this mean that with this
configuration all nodes are
active ? Is
this a problem ? Is there any
other check to avoid this
situation if
it's not good ?
Yes all Geo-rep workers will become
Active and participate in syncing.
Since changelogs will have the same
information in replica bricks this
will lead to duplicate syncing and
consuming network bandwidth.
Node-uuid based Active worker is the
default configuration in Geo-rep
till now, Geo-rep also has Meta
Volume based syncronization for Active
worker using lock files.(Can be
opted using Geo-rep configuration,
with this config node-uuid will not
be used)
Kotresh proposed a solution to
configure which worker to become
Active. This will give more control
to Admin to choose Active workers,
This will become default
configuration from 3.12
https://github.com/gluster/glusterfs/issues/244
<https://github.com/gluster/glusterfs/issues/244 >
--
Aravinda
Xavi
Bricks:
<guid>
AFR/EC:
<guid>(<guid>, <guid>)
DHT:
<guid>(<guid>(<guid>, ...),
<guid>(<guid>, ...))
In this case, AFR
and EC would return the same
<guid> they
returned before the
patch, but between '(' and
')' they put the
full list of guid's
of all nodes. The first
<guid> can be used
by geo-replication.
The list after the first
<guid> can be used
for rebalance.
Not sure if there's
any user of node-uuid above DHT.
Xavi
Xavi
On Tue,
Jun 20, 2017 at 12:46 PM,
Xavier Hernandez
<xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>>
<mailto:xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>
>>>>
wrote:
Hi
Pranith,
On
20/06/17 07:53, Pranith
Kumar Karampuri
wrote:
hi Xavi,
We all made the
mistake of not
sending about
changing
behavior of
node-uuid xattr so that
rebalance can use
multiple nodes
for doing
rebalance. Because of this
on geo-rep all
the workers
are becoming
active instead of one per
EC/AFR subvolume.
So we are
frantically trying
to restore the functionality
of node-uuid
and introduce
a new
xattr for
the new behavior. Sunil will
be sending out
a patch for
this.
Wouldn't it be better to
change geo-rep
behavior
to use the
new data
? I
think it's better as it's
now, since it
gives more
information
to
upper layers so that they
can take more
accurate decisions.
Xavi
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
--
Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel