----- Original Message ----- > From: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> > To: "Jeff Darcy" <jdarcy@xxxxxxxxxx> > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > Sent: Tuesday, February 2, 2016 7:52:25 PM > Subject: Re: regarding GF_CONTENT_KEY and dht2 - perf with small files > > > > On 02/02/2016 06:22 PM, Jeff Darcy wrote: > >> Background: Quick-read + open-behind xlators are developed to help > >> in small file workload reads like apache webserver, tar etc to get the > >> data of the file in lookup FOP itself. What happens is, when a lookup > >> FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and > >> posix xlator reads the file and fills the data in xdata response if this > >> key is present as long as the file-size is less than max-length given in > >> the xdata. So when we do a tar of something like a kernel tree with > >> small files, if we look at profile of the bricks all we see are lookups. > >> OPEN + READ fops will not be sent at all over the network. > >> > >> With dht2 because data is present on a different cluster. We can't > >> get the data in lookup. Shyam was telling me that opens are also sent to > >> metadata cluster. That will make perf in this usecase back to where it > >> was before introducing these two features i.e. 1/3 of current perf > >> (Lookup vs lookup+open+read) > > Is "1/3 of current perf" based on actual measurements? My understanding > > was that the translators in question exist to send requests *in parallel* > > with the original lookup stream. That means it might be 3x the messages, > > but it will only be 1/3 the performance if the network is saturated. > > Also, the lookup is not guaranteed to be only one message. It might be > > as many as N (the number of bricks), so by the reasoning above the > > performance would only drop to N/N+2. I think the real situation is a > > bit more complicated - and less dire - than you suggest. > > As per what I heard, when quick read (Now divided as open-behind and > quick-read) was introduced webserver use case users reported 300% to > 400% perf improvement. I second that. Even I've heard similar improvements for webserver use cases (Quick read was first written with apache as the use case). I tried looking for any previous data on this, but unfortunately couldn't find any. But nevertheless, we can do some performance benchmark ourselves. > We should definitely test it once we have enough code to do so. I am > just giving a heads up. > > Having said that, for 'tar' I think we can most probably do a better job > in dht2 because even after readdirp a nameless lookup comes. If it has > GF_CONTENT_KEY we should send it to data cluster directly. For webserver > usecase I don't have any ideas. > > At least on my laptop this is what I saw, on a setup with different > client, server machines, situation could be worse. This is distribute > volume with one brick. > > root@localhost - /mnt/d1 > 19:42:52 :) ⚡ time tar cf a.tgz a > > real 0m6.987s > user 0m0.089s > sys 0m0.481s > > root@localhost - /mnt/d1 > 19:43:22 :) ⚡ cd > > root@localhost - ~ > 19:43:25 :) ⚡ umount /mnt/d1 > > root@localhost - ~ > 19:43:27 :) ⚡ gluster volume set d1 open-behind off > volume set: success > > root@localhost - ~ > 19:43:47 :) ⚡ gluster volume set d1 quick-read off > volume set: success > > root@localhost - ~ > 19:44:03 :( ⚡ gluster volume stop d1 > Stopping volume will make its data inaccessible. Do you want to > continue? (y/n) y > volume stop: d1: success > > root@localhost - ~ > 19:44:09 :) ⚡ gluster volume start d1 > volume start: d1: success > > root@localhost - ~ > 19:44:13 :) ⚡ mount -t glusterfs localhost.localdomain:/d1 /mnt/d1 > > root@localhost - ~ > 19:44:29 :) ⚡ cd /mnt/d1 > > root@localhost - /mnt/d1 > 19:44:30 :) ⚡ time tar cf b.tgz a > > real 0m12.176s > user 0m0.098s > sys 0m0.582s > > Pranith > > > >> I suggest that we send some fop at the > >> time of open to data cluster and change quick-read to cache this data on > >> open (if not already) then we can reduce the perf hit to 1/2 of current > >> perf, i.e. lookup+open. > > At first glance, it seems pretty simple to do something like this, and > > pretty obvious that we should. The tricky question is: where should we > > send that other op, before lookup has told us where the partition > > containing that file is? If there's some reasonable guess we can make, > > the sending an open+read in parallel with the lookup will be helpful. > > If not, then it will probably be a waste of time and network resources. > > Shyam, is enough of this information being cached *on the clients* to > > make this effective? > Pranith > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel