Quick performance tests

landman at scalableinformatics.com (Joe Landman) · Sat, 15 Jan 2011 13:31:32 -0500

Given the discussion over the past few days, I did a quick-n-dirty test. 
  Long-ish post, with data, pointers, etc.

Gigabit connected server and client, 941 Mb/s (according to iperf) 
between the two.

Untar 2.6.37 kernel source, drop caches before each run.  485 MB total 
untarred/uncompressed size.

all times measured in seconds, all mounts with default options (though 
we used -o intr,tcp for NFS)

server  client  client  client        client
local   local   NFS     Gluster-NFS   GlusterFS
-------------------------------------------------
3.9     9       85.97   143.5         132.3

So Gluster-NFS translator (using NFS on client to mount file system on 
remote system) requires about 67% more time than straight NFS, and the 
Gluster mount on the client requires about 54% more time than NFS.

Does this mean NFS is faster?  In this simplified measurement, yes, but 
not by a huge amount.  And we don't recommend extrapolating to the 
general case from this.

Moreover, we haven't tuned the GlusterFS implementation at all.  For 
laughs, I turned up the caching a bit

[root at jr5-lab local]# gluster volume set nfs-test performance.cache-size 1GB
Set volume successful
[root at jr5-lab local]# gluster volume set nfs-test 
performance.write-behind-window-size 512MB
Set volume successful
[root at jr5-lab local]# gluster volume set nfs-test 
performance.stat-prefetch 1
Set volume successful

(note: even with 3.1.2, that last bits are still undocumented)

With this, I was able to get the Gluster-NFS client to be about the same 
as the native Gluster client.

There is still tuning that could be done, but there is a significant 
performance cost to doing many stat calls.  There is little that you can 
really do about that, other than to not do so many stat calls (which my 
not be an option in and of itself).

Our test looked like this btw:

/usr/bin/time --verbose tar -xzf ~/kernel-2.6.37.scalable.tar.gz

which provides a great deal of information to the end user:

Command being timed: "tar -xzf /root/kernel-2.6.37.scalable.tar.gz"
	User time (seconds): 11.08
	System time (seconds): 14.53
	Percent of CPU this job got: 29%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:26.53
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 4240
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 3
	Minor (reclaiming a frame) page faults: 566
	Voluntary context switches: 297422
	Involuntary context switches: 691
	Swaps: 0
	File system inputs: 187120
	File system outputs: 951496
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

For laughs, I also wrote up a quick and dirty strace parsing tool, so 
you can run an experiment like this:

	strace tar -xzf ~/kernel-2.6.37.scalable.tar.gz > q 2>&1
	cat q | ~/iohist.pl

and then you'll see this

[root at virtual nfs]# cat q | ~/iohist.pl
  read operations: 42563
write operations: 73798
  meta operations: 159777
Total operations: 276138

      read size (Bytes): 434322885
     write size (Bytes): 403743188
Total IO size (Bytes) : 838066073

     Average read size (Bytes): 10204.2
    Average write size (Bytes): 5470.9

bin resolution = 512

With a little work, I can have it pull out timing information from 
strace, and then we can construct an average time per operation.

We get about the same data regardless of local versus remote writes. 
This should help elucidate why traversing two network and storage stacks 
is so much more costly than traversing one.  Same number of operations, 
just a higher cost per operation.  Which strongly suggests you want to 
amortize each operation over a larger read/write.

It also generates a read.hist and a write.hist  which output binned data 
(currently with 512 byte resolution).  As you can see, there isn't too 
much in the way of reads (apart from the tarball at about 10kB), and 
quite a bit in the way of writes (actually follows a nice distribution 
apart from the outliers at the end)

[root at virtual nfs]# cat read.hist
6
10
1
2
72
0
0
0
59
0
0
0
59
0
0
0
72
0
0
0
42282

[root at virtual nfs]# cat write.hist
5656
5596
4906
4387
3718
3187
2866
2588
2421
2197
1956
1830
1660
1600
1470
1336
1315
1198
1018
1046
21847

Basically, those reads, writes, and meta-ops are expensive over the 
wire.  For NFS and GlusterFS, and any other network/cluster file system.

If you are deploying a web server scenario, you might want to set up a 
local cache (RAMdisk or local SSD based), fed from GlusterFS on start.

I should also point out, that this is what we mean by large files (think 
VM images) read and written 1MB at a time.

[root at virtual nfs]# dd if=/dev/zero of=big.file bs=1M count=1k
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.6762 seconds, 101 MB/s

[root at virtual nfs]# echo 3 > /proc/sys/vm/drop_caches

[root at virtual nfs]# dd if=big.file of=/dev/null bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.12858 seconds, 118 MB/s

And this is what iohist has to say about this:

[root at virtual nfs]# echo 3 > /proc/sys/vm/drop_caches
[root at virtual nfs]# strace dd if=big.file of=/dev/null bs=1M > b 2>&1
[root at virtual nfs]# cat b | ~/iohist.pl
  read operations: 1030
write operations: 1025
  meta operations: 18
Total operations: 2073

      read size (Bytes): 1073746848
     write size (Bytes): 1073741856
Total IO size (Bytes) : 2147488704

     Average read size (Bytes): 1042472.7
    Average write size (Bytes): 1047553.0

bin resolution = 512

read.hist and write.hist show most the 1M reads/writes, apart from minor 
metadata bits.

You can grab iohist.pl here:

http://download.scalableinformatics.com/iohist/iohist.pl

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615