Re: Tuning for small files

Thibault Godouet <tibo92@xxxxxxxxxxx> · Fri, 2 Oct 2015 18:29:46 +0100

Right, so what I did is:

- on one node (gluster 3.7.3), run 'gluster volume shared profile start'

- on the client mount, run the test

- on the node, run 'gluster volume shared profile info' (and copied the output)

- finally, ran 'gluster volume profile shared stop'
I repeated this for two different tests (simple rm followed by svn checkout, and a more complete build test), on an NFS mount and on a Fuse mount.
To my surprise the svn checkout is actually a lot faster (3x) on the Fuse mount than NFS.

However the build test is a lot slower on the Fuse mount (+50%, which is a lot considering the compilation is CPU intensive, not just I/Os!).
Ben I will send you the profile outputs separately now...
On 29 Sep 2015 9:40 pm, "Ben Turner" <bturner@xxxxxxxxxx> wrote:
----- Original Message -----

> From: "Thibault Godouet" <tibo92@xxxxxxxxxxx>

> To: "Ben Turner" <bturner@xxxxxxxxxx>

> Cc: hmlth@xxxxxxxxxx, gluster-users@xxxxxxxxxxx

> Sent: Tuesday, September 29, 2015 1:36:20 PM

> Subject: Re:  Tuning for small files

>

> Ben,

>

> I suspect meta-data / 'ls -l' performance is very important for my svn

> use-case.

>

> Having said that, what do you mean by small file performance? I thought

> what people meant by this was really the overhead of meta-data, with a 'ls

> -l' being a sort of extreme case (pure meta-data).

> Obviously if you also have to read and write actual data (albeit not much

> at all per file), then the effect of meta-data overhead would get diluted

> to a degree, bit potentially still very present.

Where you run into problems with smallfiles on gluster is latency of sending data over the wire.  For every smallfile create there are a bunch of different file opetations we have to do on every file.  For example we will have to do at least 1 lookup per brick to make sure that the file doesn't exist anywhere before we create it.  We actually got it down to 1 per brick with lookup optimize on, its 2 IIRC(maybe more?) with it disabled.  So the time we spend waiting for those lookups to complete adds to latency which lowers the number of files that can be created in a given period of time.  Lookup optimize was implemented in 3.7 and like I said its now at the optimal 1 lookup per brick on creates.

The other problem with small files that we had in 3.6 is that we were using a single threaded event listener(epoll is what we call it).  This single thread would spike a CPU to 100%(called a hot thread) and glusterfs would become CPU bound.  The solution here was to make the event listener multi threaded so that we could spread the epoll load across CPUs there by eliminating the CPU bottleneck and allowing us to process more events in a given time.  FYI epoll is defaulted to 2 threads in 3.7, but I have seen cases where I still bottlenecked on CPU without 4 threads in my envs, so I usually do 4.  This was implemented in upstream 3.7 but was backported to RHGS 3.0.4 if you have a RH based version.

Fixing these two issues lead to the performance gains I was talking about with smallfile creates.  You are probably thinking from a distributed FS + metadata server perspective(MDS) where the bottleneck is the MDS for smallfiles.  Since gluster doesn't have an MDS that load is transferred to the clients / servers and this lead to a CPU bottleneck when epoll was single threaded.  I think this is the piece you may have been missing.

>

> Would there be an easy way to tell how much time is spent on meta-data vs.

> Data in a profile output?

Yep!  Can you gather some profiling info and send it to me?

>

> One thing I wonder: do your comments apply to both native Fuse and NFS

> mounts?

>

> Finally, all this brings me back to my initial question really: are there

> any tuning recommendation of configuration tuning for my requirement (small

> file read/writes on a pair of nodes with replication) beyond the thread

> counts and lookup optimize?

> Or are those by far the most important in this scenario?

For creating a bunch of small files those are the only two that I know of that will have a large impact, maybe some others from the list can give some input on anything else we can do here.

-b

>

> Thx,

> Thibault.

> ----- Original Message -----

> > From: hmlth@xxxxxxxxxx

> > To: abauer@xxxxxxxxx

> > Cc: gluster-users@xxxxxxxxxxx

> > Sent: Monday, September 28, 2015 7:40:52 AM

> > Subject: Re:  Tuning for small files

> >

> > I'm also quite interested by small files performances optimization, but

> > I'm a bit confused about the best option between 3.6/3.7.

> >

> > Ben Turner was saying that 3.6 might give the best performances:

> > http://www.gluster.org/pipermail/gluster-users/2015-September/023733.html

> >

> > What kind of gain is expected (with consistent-metadata) if this

> > regression is solved?

>

> Just to be clear, the issue I am talking about is metadata only(think ls -l

> or file browsing).  It doesn't affect small file perf(well not that much,

> I'm sure a little, but I have never quantified it), with server and client

> event threads set to 4 + lookup optimize I see between a 200-300% gain on

> my systems on 3.7 vs 3.6 builds.  If I needed fast metadata I would go with

> 3.6, if I need fast smallfile I would go with 3.7.  If I needed both I

> would pick the less of the two evils and go with that one and upgrade when

> the fix is released.

>

> -b

>

>

> >

> > I tried 3.6.5 (last version for debian jessie), and it's a bit better

> > than 3.7.4 but not by much (10-15%).

> >

> > I was also wondering if there is recommendations for the underlying file

> > system of the bricks (xfs, ext4, tuning...).

> >

> >

> > Regards

> >

> > Thomas HAMEL

> >

> > On 2015-09-28 12:04, André Bauer wrote:

> > > If you're not already on Glusterfs 3.7.x i would recommend an update

> > > first.

> > >

> > > Am 25.09.2015 um 17:49 schrieb Thibault Godouet:

> > >> Hi,

> > >>

> > >> There are quite a few tuning parameters for Gluster (as seen in

> > >> Gluster

> > >> volume XYZ get all), but I didn't find much documentation on those.

> > >> Some people do seem to set at least some of them, so the knowledge

> > >> must

> > >> be somewhere...

> > >>

> > >> Is there a good source of information to understand what they mean,

> > >> and

> > >> recommendation on how to set them to get a good small file

> > >> performance?

> > >>

> > >> Basically what I'm trying to optimize is for svn operations (e.g. svn

> > >> checkout, or svn branch) on a replicated 2 x 1 volume (hosted on 2

> > >> VMs,

> > >> 16GB ram, 4 cores each, 10Gb/s network tested at full speed), using a

> > >> NFS mount which appears much faster than fuse in this case (but still

> > >> much slower than when served by a normal NFS server).

> > >> Any recommendation for such a setup?

> > >>

> > >> Thanks,

> > >> Thibault.

> > >>

> > >>

> > >>

> > >> _______________________________________________

> > >> Gluster-users mailing list

> > >> Gluster-users@xxxxxxxxxxx

> > >> http://www.gluster.org/mailman/listinfo/gluster-users

> > >>

> > >

> > >

> > > --

> > > Mit freundlichen Grüßen

> > > André Bauer

> > >

> > > MAGIX Software GmbH

> > > André Bauer

> > > Administrator

> > > August-Bebel-Straße 48

> > > 01219 Dresden

> > > GERMANY

> > >

> > > tel.: 0351 41884875

> > > e-mail: abauer@xxxxxxxxx

> > > abauer@xxxxxxxxx <mailto:Email>

> > > www.magix.com <http://www.magix.com/>

> > >

> > >

> > > Geschäftsführer | Managing Directors: Dr. Arnd Schröder, Michael Keith

> > > Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205

> > >

> > > Find us on:

> > >

> > > <http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>

> > > <http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>

> > > ----------------------------------------------------------------------

> > > The information in this email is intended only for the addressee named

> > > above. Access to this email by anyone else is unauthorized. If you are

> > > not the intended recipient of this message any disclosure, copying,

> > > distribution or any action taken in reliance on it is prohibited and

> > > may be unlawful. MAGIX does not warrant that any attachments are free

> > > from viruses or other defects and accepts no liability for any losses

> > > resulting from infected email transmissions. Please note that any

> > > views expressed in this email may be those of the originator and do> >

> Gluster-users mailing list

> > > Gluster-users@xxxxxxxxxxx

> > > http://www.gluster.org/mailman/listinfo/gluster-users

> >

> > _______________________________________________

> > Gluster-users mailing list

> > Gluster-users@xxxxxxxxxxx

> > http://www.gluster.org/mailman/listinfo/gluster-users

> _______________________________________________

> Gluster-users mailing list

> Gluster-users@xxxxxxxxxxx

> http://www.gluster.org/mailman/listinfo/gluster-users

>

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users