I would like to present this problem in a different context, than
solving this with a meta data server approach, and also state some of
the ongoing efforts, and wish list items, to address problems of this
nature (hence the top post).
Comments welcome! (there is a lot of "hand waving" below BTW :) )
The problem you discuss, is with regards to fan out calls rather than
actual calls themselves. I.e if we still do, lock (N+M)->record file
length(N+M)->write->unlock(N+M) using a meta data server in place, this
would only reduce to just removing the fan out portions of the same, i.e
lock->record file length->write->unlock.
(I am not an expert at EC, so I assume the sequence is right, at least
lock->unlock is present in such forms in EC and AFR, so the discussion
could proceed with these in mind).
A) Fan out, slows down responses to the slowest response, but if we
could remove some steps in the process (say the entire lock->unlock)
then we would be better placed to speed up the stack. One of the
possibilities to do this is using delegation support in the Gluster
stack that has been added for NFS Ganesha.
With piggy backed auto delegation support for a file open/creat/lookup
to a gluster client stack, the locks are local to the client and hence
do not involve network round trips for the same. Some parts of this are
in [1] and some discussed in [2].
B) For the fan out itself, NSR (see [3]) proposes server side leader
election, which could be the meta data server equivalent, but
nevertheless distributed for each EC/AFR set. Thereby removing any
single MDS limitations, and distributing MDS loads as well.
In this scheme of things, the leader needs to do local locks, rather
than the client having to send in lock requests, thereby reducing
possible FOPs again. Also, the leader can record file length etc. and
failed transactions can be handled better, further possibly reducing
other network FOP/call.
If at all possible, with A+B we should be able to come to a point of 1:1
call count between, FOP by the client, to a network FOP to a brick (and
in some occasion a fan out of 1:k). Which would mean equivalence for the
most part to any existing network file system that is not distributed
(e.g NFS).
C) For DHT related issues in the fan out of directory operations, work
around the same is being discussed as DHT2 here [4].
The central theme for DHT2 being, directory in one subvolume, hence
eliminate fan out and also bring in better consistency to various FOPs
that DHT performs.
Overall, with these approaches, we (as in gluster) would/should aim for
better consistency (first), with improved network utilization and
reduced round trips to improve performance (next).
Foot note: None of this is ready yet, and would take time, this is just
a *possible* direction that gluster core is going ahead with to address
various problems at scale.
Shyam
[1]
http://www.gluster.org/community/documentation/index.php/Features/caching
[2] http://www.gluster.org/pipermail/gluster-devel/2014-February/039900.html
[3]
http://www.gluster.org/community/documentation/index.php/Features/new-style-replication
[4]
http://www.gluster.org/community/documentation/index.php/Features/dht-scalability
On 06/21/2015 08:33 AM, 张兵 wrote:
Thank you for your reply.
In glusterfs,Some metadata information is recorded in the file's
extended attr in all
bricks,
For example EC volume, N+M mode, stat file requires N+M command,
file write, requires M+M lock, record file length, and also N+M setattr
command,
Finally n+m unlock command;
if have metadataserver,All metadata related operations
only one command to metadata server;
As the old topic, MKDIR requires that all the DHT children should be
executed Mkdir;
Another difficult problem, lack of centralized metadata; disk recovery
performance is not able to get a massive upgrade;Such as EC N+M volume,
disk reconstruction, and only bricks n+m to participate in the
reconstruction;Rebuilding 1TB takes several hours;
The use of metadata, the data can be dispersed to all the disk,Disk
failure, a lot of disk can be involved in the reconstruction;
How to solve these difficulties.
At 2015-06-21 05:31:58, "Vijay Bellur" <vbellur@xxxxxxxxxx> wrote:
On Friday 19 June 2015 10:43 PM, 张兵 wrote:
Hi all
In the use of the glusterfs ,found file system commands a lot, such
as stat, lookup,setfattr, the very influence system performance,
especially with EC volume. The use of glusterfs code architecture and
add metadata server xlater and achieve similar GFS architecture; so, the
same set of software, users can choose their own metadata server or not
to choose the metadata server;
How do you expect the metadata server to aide performance here? There
would be network trips to the metadata servers to set/fetch necessary
information. If the intention is to avoid the penalty of having to fetch
information from disk, we have been investigating the possibility of
loading md-cache as part of the brick process graph to avoid hitting the
disk for repetitive fetch of attributes & extended attributes. I expect
that to be mainlined soon.
If you have other ideas on how a metadata server can improve
performance, that would be interesting to know.
Regards,
Vijay
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users