Performance optimization tips Gluster 3.3? (small files / directory listings)

B.Candler at pobox.com (Brian Candler) · Fri, 8 Jun 2012 09:20:14 +0100

On Fri, Jun 08, 2012 at 12:19:58AM -0400, olav johansen wrote:
>    # mount -t glusterfs fs1:/data-storage /storage
>    I've copied over my data to it again and doing a ls several times,
>    takes ~0.5 seconds:
>    [@web1 files]# time ls -all|wc -l

Like I said before, please also try without the "-l" flags and compare the
results.

My guess is that ls -al or ls -alR are not representative of the *real*
workload you are going to ask of your system (i.e. "scan all the files in
this directory, sequentially, and perform a stat() call on each one in
turn") - but please contradict me if I'm wrong.

However you need to measure how much cost that "-l" is giving you.

>    Doing the same thing on the raw os files on one node takes 0.021s
>    [@fs2 files]# time ls -all|wc -l
>    1989
>    real    0m0.021s
>    user    0m0.007s
>    sys     0m0.015s

In that case it's probably all coming from cache. If you wanted to test
actual disk performance then you would do

echo "3" >/proc/sys/vm/drop_caches

before each test (on both client and server, if they are different
machines).

But from what you say, it sounds like you are actually more interested in
the cached answers anyway.

>    Just as crazy reference, on another single server with SSD's (Raid 10)
>    drives I get:
>    files# time ls -alR|wc -l
>    2260484
>    real    0m15.761s
>    user    0m5.170s
>    sys     0m7.670s
>    For the same operation. (this server even have more files...)

You are not comparing like-for-like. A replicated volume behaves very
differently from a single brick or distributed volume, as explained before.

If you compared a two-brick (HD) setup with an identical two-brick (SSD)
setup then that would be meaningful.  I would expect that if everything is
cacheable then you'd get the same results for both.  In that case, what
you'd show is that the latency for open/stat and heal is the cause of the
delay.

Like I said before, I expect that adding the "-l" flag to ls is giving you
lots of cumulative latency.

This means that the server is actually idle for a lot of the time, while
it's waiting for the next request. So the server has spare capacity for
handling other clients.

In other words: if your real workload is actually lots of clients accessing
the system concurrently, you'll get a much better total throughput than the
simple tests you are doing, which are a single client performing single
operations one after the other.

>    If I added two more bricks to the cluster / replicated, would this
>    double read speed?

Definitely not. The latency would be the same, it's just that some requests
would go to bricks A and B, and other requests would go to bricks C and D.
The other two bricks would be idle, and would not speed things up.

However, if you had concurrent accesses from multiple clients, the extra
bricks would give extra capacity so that the total *throughput* would be
higher when there are multiple clients active.

So I repeat my advice before. If you really want to understand where the
performance issues are coming from, these two tests may highlight them:

* Compare the same 2-brick replicated volume,
  using "ls -aR" versus "ls -laR"

* Compare a 2-brick replicated volume to a 2-brick distributed volume,
  using "ls -laR" on both

Regards,

Brian.