On 02/03/2016 11:49 AM, Pranith Kumar Karampuri wrote:
On 02/03/2016 09:20 AM, Shyam wrote:
On 02/02/2016 06:22 PM, Jeff Darcy wrote:
Background: Quick-read + open-behind xlators are developed
to help
in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if
this
key is present as long as the file-size is less than max-length
given in
the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are
lookups.
OPEN + READ fops will not be sent at all over the network.
With dht2 because data is present on a different cluster. We
can't
get the data in lookup. Shyam was telling me that opens are also
sent to
metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)
This is interesting thanks for the heads up.
Is "1/3 of current perf" based on actual measurements? My
understanding
was that the translators in question exist to send requests *in
parallel*
with the original lookup stream. That means it might be 3x the
messages,
but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message. It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2. I think the real situation is a
bit more complicated - and less dire - than you suggest.
I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this
data on
open (if not already) then we can reduce the perf hit to 1/2 of
current
perf, i.e. lookup+open.
At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should. The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is? If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?
The file data would be located based on its GFID, so before the
*first* lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity
against renames and the like, as a name hash could change the
location information for the file (among other reasons).
The open+read can be done as a single FOP,
- open for a read only case can do access checking on the client to
allow the FOP to proceed to the DS without hitting the MDS for an
open token
The client side cache is important from this and other such
perspectives. It should also leverage upcall infra to keep the cache
loosely coherent.
One thing to note here would be, for the client to do a lookup (where
the file name should be known before hand), either a readdir/(p) has
to have happened, or the client knows the name already (say
application generated names). For the former (readdir case), there is
enough information on the client to not need a lookup, but rather
just do the open+read on the DS. For the latter the first lookup
cannot be avoided, degrading this to a lookup+(open+read).
Some further tricks can be done to do readdir prefetching on such
workloads, as the MDS runs on a DB (eventually), piggybacking more
entries than requested on a lookup. I would possibly leave that for
later, based on performance numbers in the small file area.
I strongly suggest that we don't postpone this to later as I think
this is a solved problem. http://www.ietf.org/rfc/rfc4122.txt section
4.3 may be of help here. i.e. create UUID based on string, namespace.
So we can use pgfid as namespace and filename as string. I understand
that we will get into 2 hops if the file is renamed, but it is the
best we can do right now. We can take help from crypto team in Redhat
to make sure we do the right thing. If we get this implementation in
dht2 after the code is released all the files created with old
gfid-generation will work with half the possible perf.
Gah! ignore, it will lead to gfid collisions :-/
Pranith
Pranith
Shyam
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel