Re: Optimizing write performance to a few large files in a small cluster

Carlos Capriotti <capriotti.carlos@xxxxxxxxx> · Mon, 10 Mar 2014 22:43:59 +0100

Alexander:
Performance is quite a vague concept. Relative, even. I don't mean to start some philosophy or anything, but it is true. 

To begin with how are you connecting to the gluster volumes ? NFS ? Fuse (native glusterfs) ?

What volume set are you using ? Striped ? Distributed ?

How is your network set ? Jumbo frames ? 

From the details you provided, you are not a first timer. Sounds like you've been doing a lot of research. Did  you happen to test the performance with other services, for instance, native NFS,or even ol-n-good FPT  ? 

Is network performance ok ? 

I was fighting some read and write performance issues a couple of weeks ago on my test servers, and it turns out it was the buffers on my NFS client. Tweaking that, and performance for LARGE FILES COPY saturated 1 Gbps.

But in the process I've collected an interesting number of gluster and systcl hacks that seemed to improve performance as well.

Use at your own risk, for this affects memory usage on your server:

For sysctl.conf:

net.core.wmem_max=12582912
net.core.rmem_max=12582912
net.ipv4.tcp_rmem= 10240 87380 12582912
net.ipv4.tcp_wmem= 10240 87380 12582912
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
vm.swappiness=10
vm.dirty_background_ratio=1
net.ipv4.neigh.default.gc_thresh2=2048
net.ipv4.neigh.default.gc_thresh3=4096
net.core.netdev_max_backlog=2500
net.ipv4.tcp_mem= 12582912 12582912 12582912

If using a NFS client, use the following mount options:

-o rw,async,vers=3,rsize=65536,wsize=65536

Gliuster options I am currently using:

network.remote-dio: on
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
network.ping-timeout: 20
nfs.nlm: off
nfs.addr-namelookup: off

Other gluster options I found elsewhere and are worth a try:

gluster volume set BigVol diagnostics.brick-log-level WARNING
gluster volume set BigVol diagnostics.client-log-level WARNING
gluster volume set BigVol nfs.enable-ino32 on

gluster volume set BigVol performance.cache-max-file-size 2MB
gluster volume set BigVol performance.cache-refresh-timeout 4
gluster volume set BigVol performance.cache-size 256MB
gluster volume set BigVol performance.write-behind-window-size 4MB
gluster volume set BigVol performance.io-thread-count 32

Now, DO keep in mind: mine is a TEST environment, while yours is a real-life situation. 

Cheers, 

Carlos

On Mon, Mar 10, 2014 at 7:06 PM, Alexander Valys <avalys@xxxxxxxxxx> wrote:

A quick performance question.

I have a small cluster of 4 machines, 64 cores in total.  I am running a scientific simulation on them, which writes at between 0.1 and 10 MB/s (total) to roughly 64 HDF5 files.  Each HDF5 file is written by only one process.  The writes are not continuous, but consist of writing roughly 1 MB of data to each file every few seconds.

Writing to HDF5 involves a lot of reading the file metadata and random seeking within the file,  since we are actually writing to about 30 datasets inside each file.  I am hosting the output on a distributed gluster volume (one brick local to each machine) to provide a unified namespace for the (very rare) case when each process needs to read the other's files.

I am seeing somewhat lower performance than I expected, i.e. a factor of approximately 4 less throughput than each node writing locally to the bare drives.  I expected the write-behind cache to buffer each write, but it seems that the writes are being quickly flushed across the network regardless of what write-behind cache size I use (32 MB currently), and the simulation stalls while waiting for the I/O operation to finish.  Anyone have any suggestions as to what to look at?  I am using gluster 3.4.2 on ubuntu 12.04.  I have flush-behind turned on, and have mounted the volume with direct-io-mode=disable, and have the cache size set to 256M.

The nodes are connected via a dedicated gigabit ethernet network, carrying only gluster traffic (no simulation traffic).

(sorry if this message comes through twice, I sent it yesterday but was not subscribed)

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://supercolony.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users