Re: regarding GF_CONTENT_KEY and dht2 - perf with small files

Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> · Tue, 2 Feb 2016 23:52:58 -0500 (EST)

----- Original Message -----
> From: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>
> To: "Jeff Darcy" <jdarcy@xxxxxxxxxx>
> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
> Sent: Tuesday, February 2, 2016 7:52:25 PM
> Subject: Re:  regarding GF_CONTENT_KEY and dht2 - perf with	small files
> 
> 
> 
> On 02/02/2016 06:22 PM, Jeff Darcy wrote:
> >>        Background: Quick-read + open-behind xlators are developed to help
> >> in small file workload reads like apache webserver, tar etc to get the
> >> data of the file in lookup FOP itself. What happens is, when a lookup
> >> FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
> >> posix xlator reads the file and fills the data in xdata response if this
> >> key is present as long as the file-size is less than max-length given in
> >> the xdata. So when we do a tar of something like a kernel tree with
> >> small files, if we look at profile of the bricks all we see are lookups.
> >> OPEN + READ fops will not be sent at all over the network.
> >>
> >>        With dht2 because data is present on a different cluster. We can't
> >> get the data in lookup. Shyam was telling me that opens are also sent to
> >> metadata cluster. That will make perf in this usecase back to where it
> >> was before introducing these two features i.e. 1/3 of current perf
> >> (Lookup vs lookup+open+read)
> > Is "1/3 of current perf" based on actual measurements?  My understanding
> > was that the translators in question exist to send requests *in parallel*
> > with the original lookup stream.  That means it might be 3x the messages,
> > but it will only be 1/3 the performance if the network is saturated.
> > Also, the lookup is not guaranteed to be only one message.  It might be
> > as many as N (the number of bricks), so by the reasoning above the
> > performance would only drop to N/N+2.  I think the real situation is a
> > bit more complicated - and less dire - than you suggest.
> 
> As per what I heard, when quick read (Now divided as open-behind and
> quick-read) was introduced webserver use case users reported 300% to
> 400% perf improvement.

I second that. Even I've heard similar improvements for webserver use cases (Quick read was first written with apache as the use case). I tried looking for any previous data on this, but unfortunately couldn't find any. But nevertheless, we can do some performance benchmark ourselves.

> We should definitely test it once we have enough code to do so. I am
> just giving a heads up.
> 
> Having said that, for 'tar' I think we can most probably do a better job
> in dht2 because even after readdirp a nameless lookup comes. If it has
> GF_CONTENT_KEY we should send it to data cluster directly. For webserver
> usecase I don't have any ideas.
> 
> At least on my laptop this is what I saw, on a setup with different
> client, server machines, situation could be worse. This is distribute
> volume with one brick.
> 
> root@localhost - /mnt/d1
> 19:42:52 :) ⚡ time tar cf a.tgz a
> 
> real    0m6.987s
> user    0m0.089s
> sys    0m0.481s
> 
> root@localhost - /mnt/d1
> 19:43:22 :) ⚡ cd
> 
> root@localhost - ~
> 19:43:25 :) ⚡ umount /mnt/d1
> 
> root@localhost - ~
> 19:43:27 :) ⚡ gluster volume set d1 open-behind off
> volume set: success
> 
> root@localhost - ~
> 19:43:47 :) ⚡ gluster volume set d1 quick-read off
> volume set: success
> 
> root@localhost - ~
> 19:44:03 :( ⚡ gluster volume stop d1
> Stopping volume will make its data inaccessible. Do you want to
> continue? (y/n) y
> volume stop: d1: success
> 
> root@localhost - ~
> 19:44:09 :) ⚡ gluster volume start d1
> volume start: d1: success
> 
> root@localhost - ~
> 19:44:13 :) ⚡ mount -t glusterfs localhost.localdomain:/d1 /mnt/d1
> 
> root@localhost - ~
> 19:44:29 :) ⚡ cd /mnt/d1
> 
> root@localhost - /mnt/d1
> 19:44:30 :) ⚡ time tar cf b.tgz a
> 
> real    0m12.176s
> user    0m0.098s
> sys    0m0.582s
> 
> Pranith
> >
> >> I suggest that we send some fop at the
> >> time of open to data cluster and change quick-read to cache this data on
> >> open (if not already) then we can reduce the perf hit to 1/2 of current
> >> perf, i.e. lookup+open.
> > At first glance, it seems pretty simple to do something like this, and
> > pretty obvious that we should.  The tricky question is: where should we
> > send that other op, before lookup has told us where the partition
> > containing that file is?  If there's some reasonable guess we can make,
> > the sending an open+read in parallel with the lookup will be helpful.
> > If not, then it will probably be a waste of time and network resources.
> > Shyam, is enough of this information being cached *on the clients* to
> > make this effective?
> Pranith
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel