Given the discussion over the past few days, I did a quick-n-dirty test. Long-ish post, with data, pointers, etc. Gigabit connected server and client, 941 Mb/s (according to iperf) between the two. Untar 2.6.37 kernel source, drop caches before each run. 485 MB total untarred/uncompressed size. all times measured in seconds, all mounts with default options (though we used -o intr,tcp for NFS) server client client client client local local NFS Gluster-NFS GlusterFS ------------------------------------------------- 3.9 9 85.97 143.5 132.3 So Gluster-NFS translator (using NFS on client to mount file system on remote system) requires about 67% more time than straight NFS, and the Gluster mount on the client requires about 54% more time than NFS. Does this mean NFS is faster? In this simplified measurement, yes, but not by a huge amount. And we don't recommend extrapolating to the general case from this. Moreover, we haven't tuned the GlusterFS implementation at all. For laughs, I turned up the caching a bit [root at jr5-lab local]# gluster volume set nfs-test performance.cache-size 1GB Set volume successful [root at jr5-lab local]# gluster volume set nfs-test performance.write-behind-window-size 512MB Set volume successful [root at jr5-lab local]# gluster volume set nfs-test performance.stat-prefetch 1 Set volume successful (note: even with 3.1.2, that last bits are still undocumented) With this, I was able to get the Gluster-NFS client to be about the same as the native Gluster client. There is still tuning that could be done, but there is a significant performance cost to doing many stat calls. There is little that you can really do about that, other than to not do so many stat calls (which my not be an option in and of itself). Our test looked like this btw: /usr/bin/time --verbose tar -xzf ~/kernel-2.6.37.scalable.tar.gz which provides a great deal of information to the end user: Command being timed: "tar -xzf /root/kernel-2.6.37.scalable.tar.gz" User time (seconds): 11.08 System time (seconds): 14.53 Percent of CPU this job got: 29% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:26.53 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 4240 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 3 Minor (reclaiming a frame) page faults: 566 Voluntary context switches: 297422 Involuntary context switches: 691 Swaps: 0 File system inputs: 187120 File system outputs: 951496 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 For laughs, I also wrote up a quick and dirty strace parsing tool, so you can run an experiment like this: strace tar -xzf ~/kernel-2.6.37.scalable.tar.gz > q 2>&1 cat q | ~/iohist.pl and then you'll see this [root at virtual nfs]# cat q | ~/iohist.pl read operations: 42563 write operations: 73798 meta operations: 159777 Total operations: 276138 read size (Bytes): 434322885 write size (Bytes): 403743188 Total IO size (Bytes) : 838066073 Average read size (Bytes): 10204.2 Average write size (Bytes): 5470.9 bin resolution = 512 With a little work, I can have it pull out timing information from strace, and then we can construct an average time per operation. We get about the same data regardless of local versus remote writes. This should help elucidate why traversing two network and storage stacks is so much more costly than traversing one. Same number of operations, just a higher cost per operation. Which strongly suggests you want to amortize each operation over a larger read/write. It also generates a read.hist and a write.hist which output binned data (currently with 512 byte resolution). As you can see, there isn't too much in the way of reads (apart from the tarball at about 10kB), and quite a bit in the way of writes (actually follows a nice distribution apart from the outliers at the end) [root at virtual nfs]# cat read.hist 6 10 1 2 72 0 0 0 59 0 0 0 59 0 0 0 72 0 0 0 42282 [root at virtual nfs]# cat write.hist 5656 5596 4906 4387 3718 3187 2866 2588 2421 2197 1956 1830 1660 1600 1470 1336 1315 1198 1018 1046 21847 Basically, those reads, writes, and meta-ops are expensive over the wire. For NFS and GlusterFS, and any other network/cluster file system. If you are deploying a web server scenario, you might want to set up a local cache (RAMdisk or local SSD based), fed from GlusterFS on start. I should also point out, that this is what we mean by large files (think VM images) read and written 1MB at a time. [root at virtual nfs]# dd if=/dev/zero of=big.file bs=1M count=1k 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.6762 seconds, 101 MB/s [root at virtual nfs]# echo 3 > /proc/sys/vm/drop_caches [root at virtual nfs]# dd if=big.file of=/dev/null bs=1M 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 9.12858 seconds, 118 MB/s And this is what iohist has to say about this: [root at virtual nfs]# echo 3 > /proc/sys/vm/drop_caches [root at virtual nfs]# strace dd if=big.file of=/dev/null bs=1M > b 2>&1 [root at virtual nfs]# cat b | ~/iohist.pl read operations: 1030 write operations: 1025 meta operations: 18 Total operations: 2073 read size (Bytes): 1073746848 write size (Bytes): 1073741856 Total IO size (Bytes) : 2147488704 Average read size (Bytes): 1042472.7 Average write size (Bytes): 1047553.0 bin resolution = 512 read.hist and write.hist show most the 1M reads/writes, apart from minor metadata bits. You can grab iohist.pl here: http://download.scalableinformatics.com/iohist/iohist.pl -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615