Performance optimization tips Gluster 3.3? (small files / directory listings)

luxis2012 at gmail.com (olav johansen) · Fri, 8 Jun 2012 00:19:58 -0400

Hi All,

Thanks for great feedback, I had changed ip's and I noticed one server
wasn't connecting correctly when checking log.

To ensure I had no wrong-doings I've re-done the bricks from scratch, clean
configurations, with mount info attached below, still not performing
'great' compared to a single NFS mount.

The application we're running our files don't change, we only add / delete
files, so I'd like to get directory / file info cached as much as possible.

Config info:
gluster> volume info data-storage

Volume Name: data-storage
Type: Replicate
Volume ID: cc91c107-bdbb-4179-a097-cdd3e9d5ac93
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: fs1:/data/storage
Brick2: fs2:/data/storage
gluster>

On my web1 node I mounted:
# mount -t glusterfs fs1:/data-storage /storage

I've copied over my data to it again and doing a ls several times, takes
~0.5 seconds:
[@web1 files]# time ls -all|wc -l
1989

real    0m0.485s
user    0m0.022s
sys     0m0.109s
[@web1 files]# time ls -all|wc -l
1989

real    0m0.489s
user    0m0.016s
sys     0m0.116s
[@web1 files]# time ls -all|wc -l
1989

real    0m0.493s
user    0m0.018s
sys     0m0.115s

Doing the same thing on the raw os files on one node takes 0.021s
[@fs2 files]# time ls -all|wc -l
1989

real    0m0.021s
user    0m0.007s
sys     0m0.015s
[@fs2 files]# time ls -all|wc -l
1989

real    0m0.020s
user    0m0.008s
sys     0m0.013s

Now full directory listing even seems slower... :
[@web1 files]# time ls -alR|wc -l
2242956

real    74m0.660s
user    0m20.117s
sys     1m24.734s
[@web1 files]# time ls -alR|wc -l
2242956

real    26m27.159s
user    0m17.387s
sys     1m11.217s
[@web1 files]# time ls -alR|wc -l
2242956

real    27m38.163s
user    0m18.333s
sys     1m19.824s

Just as crazy reference, on another single server with SSD's (Raid 10)
drives I get:
files# time ls -alR|wc -l
2260484

real    0m15.761s
user    0m5.170s
sys     0m7.670s
For the same operation. (this server even have more files...)

My goal is to get this directory listing as fast as possible, I don't have
the hardware/budget to test a SSD configuration, but would a SSD setup give
me ~1minute directory listing time (assuming it is 4 times slower than
single node)?

If I added two more bricks to the cluster / replicated, would this double
read speed?

Thanks for any insight!

-------------------- storage.log from web1 on mount ---------------------
[2012-06-07 20:47:45.584320] I [glusterfsd.c:1666:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.3.0
[2012-06-07 20:47:45.624548] I [io-cache.c:1549:check_cache_size_ok]
0-data-storage-quick-read: Max cache size is 8252092416
[2012-06-07 20:47:45.624612] I [io-cache.c:1549:check_cache_size_ok]
0-data-storage-io-cache: Max cache size is 8252092416
[2012-06-07 20:47:45.628148] I [client.c:2142:notify]
0-data-storage-client-0: parent translators are ready, attempting connect
on transport
[2012-06-07 20:47:45.631059] I [client.c:2142:notify]
0-data-storage-client-1: parent translators are ready, attempting connect
on transport
Given volfile:
+------------------------------------------------------------------------------+
  1: volume data-storage-client-0
  2:     type protocol/client
  3:     option remote-host fs1
  4:     option remote-subvolume /data/storage
  5:     option transport-type tcp
  6: end-volume
  7:
  8: volume data-storage-client-1
  9:     type protocol/client
 10:     option remote-host fs2
 11:     option remote-subvolume /data/storage
 12:     option transport-type tcp
 13: end-volume
 14:
 15: volume data-storage-replicate-0
 16:     type cluster/replicate
 17:     subvolumes data-storage-client-0 data-storage-client-1
 18: end-volume
 19:
 20: volume data-storage-write-behind
 21:     type performance/write-behind
 22:     subvolumes data-storage-replicate-0
 23: end-volume
 24:
 25: volume data-storage-read-ahead
 26:     type performance/read-ahead
 27:     subvolumes data-storage-write-behind
 28: end-volume
 29:
 30: volume data-storage-io-cache
 31:     type performance/io-cache
 32:     subvolumes data-storage-read-ahead
 33: end-volume
34:
 35: volume data-storage-quick-read
 36:     type performance/quick-read
 37:     subvolumes data-storage-io-cache
 38: end-volume
 39:
 40: volume data-storage-md-cache
 41:     type performance/md-cache
 42:     subvolumes data-storage-quick-read
 43: end-volume
 44:
 45: volume data-storage
 46:     type debug/io-stats
 47:     option latency-measurement off
 48:     option count-fop-hits off
 49:     subvolumes data-storage-md-cache
 50: end-volume

+------------------------------------------------------------------------------+
[2012-06-07 20:47:45.642625] I [rpc-clnt.c:1660:rpc_clnt_reconfig]
0-data-storage-client-0: changing port to 24009 (from 0)
[2012-06-07 20:47:45.648604] I [rpc-clnt.c:1660:rpc_clnt_reconfig]
0-data-storage-client-1: changing port to 24009 (from 0)
[2012-06-07 20:47:49.592729] I
[client-handshake.c:1636:select_server_supported_programs]
0-data-storage-client-0: Using Program GlusterFS 3.3.0, Num (1298437),
Version (330)
[2012-06-07 20:47:49.595099] I
[client-handshake.c:1636:select_server_supported_programs]
0-data-storage-client-1: Using Program GlusterFS 3.3.0, Num (1298437),
Version (330)
[2012-06-07 20:47:49.608455] I
[client-handshake.c:1433:client_setvolume_cbk] 0-data-storage-client-0:
Connected to 10.1.80.81:24009, attached to remote volume '/data/storage'.
[2012-06-07 20:47:49.608489] I
[client-handshake.c:1445:client_setvolume_cbk] 0-data-storage-client-0:
Server and Client lk-version numbers are not same, reopening the fds
[2012-06-07 20:47:49.608572] I [afr-common.c:3627:afr_notify]
0-data-storage-replicate-0: Subvolume 'data-storage-client-0' came back up;
going online.
[2012-06-07 20:47:49.608837] I
[client-handshake.c:453:client_set_lk_version_cbk] 0-data-storage-client-0:
Server lk version = 1
[2012-06-07 20:47:49.616381] I
[client-handshake.c:1433:client_setvolume_cbk] 0-data-storage-client-1:
Connected to 10.1.80.82:24009, attached to remote volume '/data/storage'.
[2012-06-07 20:47:49.616434] I
[client-handshake.c:1445:client_setvolume_cbk] 0-data-storage-client-1:
Server and Client lk-version numbers are not same, reopening the fds
[2012-06-07 20:47:49.621808] I [fuse-bridge.c:4193:fuse_graph_setup]
0-fuse: switched to graph 0
[2012-06-07 20:47:49.622793] I
[client-handshake.c:453:client_set_lk_version_cbk] 0-data-storage-client-1:
Server lk version = 1
[2012-06-07 20:47:49.622873] I [fuse-bridge.c:3376:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel
7.13
[2012-06-07 20:47:49.623440] I
[afr-common.c:1964:afr_set_root_inode_on_first_lookup]
0-data-storage-replicate-0: added root inode

-------------------- End storage.log
-----------------------------------------------------

On Thu, Jun 7, 2012 at 9:46 AM, Pranith Kumar Karampuri <pkarampu at redhat.com
> wrote:

> hi Brian,
>    'stat' command comes as fop (File-operation) 'lookup' to the gluster
> mount which triggers self-heal. So the behavior is still same.
> I was referring to the fop 'stat' which will be performed only on one of
> the bricks.
> Unfortunately most of the commands and fops have same name.
> Following are some of the examples of read-fops:
>        .access
>        .stat
>        .fstat
>        .readlink
>        .getxattr
>        .fgetxattr
>        .readv
>
> Pranith.
> ----- Original Message -----
> From: "Brian Candler" <B.Candler at pobox.com>
> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> Cc: "olav johansen" <luxis2012 at gmail.com>, gluster-users at gluster.org,
> "Fernando Frediani (Qube)" <fernando.frediani at qubenet.net>
> Sent: Thursday, June 7, 2012 7:06:26 PM
> Subject: Re: Performance optimization tips Gluster 3.3?
> (small  files / directory listings)
>
> On Thu, Jun 07, 2012 at 08:34:56AM -0400, Pranith Kumar Karampuri wrote:
> > Brian,
> >   Small correction: 'sending queries to *both* servers to check they are
> in sync - even read accesses.' Read fops like stat/getxattr etc are sent to
> only one brick.
>
> Is that new behaviour for 3.3? My understanding was that stat() was a
> healing operation.
>
> http://gluster.org/community/documentation/index.php/Gluster_3.2:_Triggering_Self-Heal_on_Replicate
>
> If this is no longer true, then I'd like to understand what happens after a
> node has been down and comes up again.  I understand there's a self-healing
> daemon in 3.3, but what if you try to access a file which has not yet been
> healed?
>
> I'm interested in understanding this, especially the split-brain scenarios
> (better to understand them *before* you're stuck in a problem :-)
>
> BTW I'm in the process of building a 2-node 3.3 test cluster right now.
>
> Cheers,
>
> Brian.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20120608/b30ac40b/attachment-0001.htm>