Greetings, in contrast to some performance tips regarding small file *read* performance, I want to share these results. The test is rather simple but yields some very remarkable results: 400% improved read performance by simply dropping some of the so called "performance translators"! Please note that this test resembles a simplified version of our workload, which is more or less sequential, read-only small file serving with an average of 100 concurrent clients. (We use GlusterFS as a flat-file backend to a cluster of webservers, which is hit only after missing some caches in a more sophisticated caching infrastructure on top of it) The test-setup is a 3 node AFR cluster, with server+client on each one, single process model (one volfile, the local volume is attached to within the same process to save overhead), connected via 1 Gbit Ethernet. This way each node can continue to operate on it's own, even if the whole internal network for GlusterFS is down. We used commodity hardware for the test. Each node is identical: - Intel Core i7 - 12G RAM - 500GB filesystem - 1 Gbit NIC dedicated for GlusterFS Software: - Linux 2.6.32.8 - GlusterFS 3.0.2 - FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13 - Filesystem / Storage Backend: - LVM2 on top of software RAID 1 - ext4 with noatime I will paste the configurations inline, so people can comment on them. /etc/fstab: ------------------------------------------------------------------------- /dev/data/test /mnt/brick/test ext4 noatime 0 2 /etc/glusterfs/test.vol /mnt/glusterfs/test glusterfs noauto,noatime,log-level=NORMAL,log-file=/var/log/glusterfs/test.log 0 0 ------------------------------------------------------------------------- *** Please note: this is the final configuration with the best results. All translators are numbered to make the explanation easier later on. Unused translators are commented out... The volume spec is identical on all nodes, except that the bind-address option in the server volume [*4*] is adjusted. *** /etc/glusterfs/test.vol ------------------------------------------------------------------------- # Sat Feb 27 16:53:00 CET 2010 John Feuerstein <john at feurix.com> # # Single Process Model with AFR (Automatic File Replication). ## ## Storage backend ## # # POSIX STORAGE [*1*] # volume posix type storage/posix option directory /mnt/brick/test/glusterfs end-volume # # POSIX LOCKS [*2*] # #volume locks volume brick type features/locks subvolumes posix end-volume ## ## Performance translators (server side) ## # # IO-Threads [*3*] # #volume brick # type performance/io-threads # subvolumes locks # option thread-count 8 #end-volume ### End of performance translators # # TCP/IP server [*4*] # volume server type protocol/server subvolumes brick option transport-type tcp option transport.socket.bind-address 10.1.0.1 # FIXME option transport.socket.listen-port 820 option transport.socket.nodelay on option auth.addr.brick.allow 127.0.0.1,10.1.0.1,10.1.0.2,10.1.0.3 end-volume # # TCP/IP clients [*5*] # volume node1 type protocol/client option remote-subvolume brick option transport-type tcp/client option remote-host 10.1.0.1 option remote-port 820 option transport.socket.nodelay on end-volume volume node2 type protocol/client option remote-subvolume brick option transport-type tcp/client option remote-host 10.1.0.2 option remote-port 820 option transport.socket.nodelay on end-volume volume node3 type protocol/client option remote-subvolume brick option transport-type tcp/client option remote-host 10.1.0.3 option remote-port 820 option transport.socket.nodelay on end-volume # # Automatic File Replication Translator (AFR) [*6*] # # NOTE: "node3" is the primary metadata node, so this one *must* # be listed first in all volume specs! Also, node3 is the global # favorite-child with the definite file version if any conflict # arises while self-healing... # volume afr type cluster/replicate subvolumes node3 node1 node2 option read-subvolume node2 option favorite-child node3 end-volume ## ## Performance translators (client side) ## # # IO-Threads [*7*] # #volume client-threads-1 # type performance/io-threads # subvolumes afr # option thread-count 8 #end-volume # # Write-Behind [*8*] # volume wb type performance/write-behind subvolumes afr option cache-size 4MB end-volume # # Read-Ahead [*9*] # #volume ra # type performance/read-ahead # subvolumes wb # option page-count 2 #end-volume # # IO-Cache [*10*] # volume cache type performance/io-cache subvolumes wb option cache-size 1024MB option cache-timeout 60 end-volume # # Quick-Read for small files [*11*] # #volume qr # type performance/quick-read # subvolumes cache # option cache-timeout 60 #end-volume # # Metadata prefetch [*12*] # #volume sp # type performance/stat-prefetch # subvolumes qr #end-volume # # IO-Threads [*13*] # #volume client-threads-2 # type performance/io-threads # subvolumes sp # option thread-count 16 #end-volume ### End of performance translators. ------------------------------------------------------------------------- So let's start now. If not explicitely stated, perform on all nodes: # Prepare filesystem mountpoints $ mkdir -p /mnt/brick/test # Mount bricks $ mount /mnt/brick/test # Prepare brick roots (so lost+found won't end up in the volume) $ mkdir -p /mnt/brick/test/glusterfs # Load FUSE $ modprobe fuse # Prepare GlusterFS mountpoints $ mkdir -p /mnt/glusterfs/test # Mount GlusterFS # (we start with Node 3 which should become the metadata master) node3 $ mount /mnt/glusterfs/test node1 $ mount /mnt/glusterfs/test node2 $ mount /mnt/glusterfs/test # While doing the tests, we watch the logs on all nodes for errors: $ tail -f /var/log/glusterfs/test.log For each volume spec change, you have to unmount GlusterFS, change the vol file, and mount GlusterFS again. Before starting tests, make sure everything is running and the volumes on all nodes are attached (watch the log files!). Write the test-data for the read-only tests. These are lot's of 20K files, which resemble most of our css/js/php/python files. You should adjust this to match your workload... ------------------------------------------------------------------------- #!/bin/bash mkdir -p /mnt/glusterfs/test/data cd /mnt/glusterfs/test/data for topdir in x{1..100} do mkdir -p $topdir cd $topdir for subdir in y{1..10} do mkdir $subdir cd $subdir for file in z{1..10} do dd if=/dev/zero of=20K-$RANDOM \ bs=4K count=5 &> /dev/null && echo -n . done cd .. done cd .. done ------------------------------------------------------------------------- OK, in our case /mnt/glusterfs/test/data is now populated with around ~240M of data... enough for some simple tests. Each test-run consists of this simplified simulation of sequentially reading all files, listing dirs and probably doing a stat(): ------------------------------------------------------------------------- $ cd /mnt/glusterfs/test/data # Always populate the io-cache first: $ time tar cf - . > /dev/null # Simulate and time 100 concurrent data consumers: $ for ((i=0;i<100;i++)); do tar cf - . > /dev/null & done; time wait ------------------------------------------------------------------------- OK, so here are the results. As stated, take them with a grain of salt. Make sure you resemble your workload. For example, read-ahead is as we see useless in this case but might improve performance for files with a different size... :) # All translators active except *7* (client io-threads after AFR) real 2m27.555s user 0m3.536s sys 0m6.888s # All translators active except *13* (client io-threads at the end) real 2m23.779s user 0m2.824s sys 0m5.604s # All translators active except *7* and *13* (no client io-threads!) real 0m53.097s user 0m3.512s sys 0m6.436s # All translators active except *7*, *13* and only 8 io-threads in *3* # instead of the default of 16 (server side io-threads) real 0m45.942s user 0m3.472s sys 0m6.612s # All translators active except *3*, *7*, *13* (no io-threads at all!) real 0m40.332s user 0m3.776s sys 0m6.424s # All translators active except *3*, *7*, *12*, *13* (no stat prefetch) real 0m39.205s user 0m3.672s sys 0m6.084s # All translators active except *3*, *7*, *11*, *12*, *13* # (no quickread) real 0m39.116s user 0m3.652s sys 0m5.816s # All translators active except *3*, *7*, *11*, *12*, *13* and # with page-count = 2 in *9* instead of 4 real 0m38.851s user 0m3.492s sys 0m5.796s # All translators active except *3*, *7*, *9*, *11*, *12*, *13* # (no read-ahead) real 0m38.576s user 0m3.356s sys 0m6.076s OK, that's it. Compare the results with all performance translators with the final basic setup without any of the magic: with all performance translators: real 2m27.555s without most performance translators: real 0m38.576s This is a _HUGE_ improvement! (disregard user and sys, they were practically the same in all tests) Some final words: - don't add performance translators blindly (!) - always test with a similar workload you will use in production - never go and copy+paste a volume spec, then moan about bad performance - don't rely on "glusterfs-volgen", it gives you just a starting point! - less translators == less overhead - read documentation for all options of all translators and get an idea: http://www.gluster.com/community/documentation/index.php/Translators (some stuff is still undocumented, but this is open source... so have a look) Best regards, John Feuerstein