Hi John, when stdout is redirected to /dev/null, tar on my laptop is not doing any reads (tar cf - . > /dev/null). Can you confirm whether tar is having same behaviour on your test setup? when redirected to any file other than /dev/null, tar is doing reads. Can you attach strace of tar? regards, On Sat, Feb 27, 2010 at 9:03 PM, John Feuerstein <john at feurix.com> wrote: > Greetings, > > in contrast to some performance tips regarding small file *read* > performance, I want to share these results. The test is rather simple > but yields some very remarkable results: 400% improved read performance > by simply dropping some of the so called "performance translators"! > > Please note that this test resembles a simplified version of our > workload, which is more or less sequential, read-only small file serving > with an average of 100 concurrent clients. (We use GlusterFS as a > flat-file backend to a cluster of webservers, which is hit only after > missing some caches in a more sophisticated caching infrastructure on > top of it) > > The test-setup is a 3 node AFR cluster, with server+client on each one, > single process model (one volfile, the local volume is attached to > within the same process to save overhead), connected via 1 Gbit > Ethernet. This way each node can continue to operate on it's own, even > if the whole internal network for GlusterFS is down. > > We used commodity hardware for the test. Each node is identical: > - Intel Core i7 > - 12G RAM > - 500GB filesystem > - 1 Gbit NIC dedicated for GlusterFS > > Software: > - Linux 2.6.32.8 > - GlusterFS 3.0.2 > - FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13 > - Filesystem / Storage Backend: > - LVM2 on top of software RAID 1 > - ext4 with noatime > > I will paste the configurations inline, so people can comment on them. > > > /etc/fstab: > ------------------------------------------------------------------------- > /dev/data/test /mnt/brick/test ext4 noatime 0 2 > > /etc/glusterfs/test.vol /mnt/glusterfs/test glusterfs > noauto,noatime,log-level=NORMAL,log-file=/var/log/glusterfs/test.log 0 0 > ------------------------------------------------------------------------- > > > *** > Please note: this is the final configuration with the best results. All > translators are numbered to make the explanation easier later on. Unused > translators are commented out... > The volume spec is identical on all nodes, except that the bind-address > option in the server volume [*4*] is adjusted. > *** > > /etc/glusterfs/test.vol > ------------------------------------------------------------------------- > # Sat Feb 27 16:53:00 CET 2010 John Feuerstein <john at feurix.com> > # > # Single Process Model with AFR (Automatic File Replication). > > > ## > ## Storage backend > ## > > # > # POSIX STORAGE [*1*] > # > volume posix > type storage/posix > option directory /mnt/brick/test/glusterfs > end-volume > > # > # POSIX LOCKS [*2*] > # > #volume locks > volume brick > type features/locks > subvolumes posix > end-volume > > > ## > ## Performance translators (server side) > ## > > # > # IO-Threads [*3*] > # > #volume brick > # type performance/io-threads > # subvolumes locks > # option thread-count 8 > #end-volume > > ### End of performance translators > > > # > # TCP/IP server [*4*] > # > volume server > type protocol/server > subvolumes brick > option transport-type tcp > option transport.socket.bind-address 10.1.0.1 # FIXME > option transport.socket.listen-port 820 > option transport.socket.nodelay on > option auth.addr.brick.allow 127.0.0.1,10.1.0.1,10.1.0.2,10.1.0.3 > end-volume > > > # > # TCP/IP clients [*5*] > # > volume node1 > type protocol/client > option remote-subvolume brick > option transport-type tcp/client > option remote-host 10.1.0.1 > option remote-port 820 > option transport.socket.nodelay on > end-volume > > volume node2 > type protocol/client > option remote-subvolume brick > option transport-type tcp/client > option remote-host 10.1.0.2 > option remote-port 820 > option transport.socket.nodelay on > end-volume > > volume node3 > type protocol/client > option remote-subvolume brick > option transport-type tcp/client > option remote-host 10.1.0.3 > option remote-port 820 > option transport.socket.nodelay on > end-volume > > > # > # Automatic File Replication Translator (AFR) [*6*] > # > # NOTE: "node3" is the primary metadata node, so this one *must* > # be listed first in all volume specs! Also, node3 is the global > # favorite-child with the definite file version if any conflict > # arises while self-healing... > # > volume afr > type cluster/replicate > subvolumes node3 node1 node2 > option read-subvolume node2 > option favorite-child node3 > end-volume > > > > ## > ## Performance translators (client side) > ## > > # > # IO-Threads [*7*] > # > #volume client-threads-1 > # type performance/io-threads > # subvolumes afr > # option thread-count 8 > #end-volume > > # > # Write-Behind [*8*] > # > volume wb > type performance/write-behind > subvolumes afr > option cache-size 4MB > end-volume > > > # > # Read-Ahead [*9*] > # > #volume ra > # type performance/read-ahead > # subvolumes wb > # option page-count 2 > #end-volume > > > # > # IO-Cache [*10*] > # > volume cache > type performance/io-cache > subvolumes wb > option cache-size 1024MB > option cache-timeout 60 > end-volume > > # > # Quick-Read for small files [*11*] > # > #volume qr > # type performance/quick-read > # subvolumes cache > # option cache-timeout 60 > #end-volume > > # > # Metadata prefetch [*12*] > # > #volume sp > # type performance/stat-prefetch > # subvolumes qr > #end-volume > > # > # IO-Threads [*13*] > # > #volume client-threads-2 > # type performance/io-threads > # subvolumes sp > # option thread-count 16 > #end-volume > > ### End of performance translators. > ------------------------------------------------------------------------- > > > > So let's start now. If not explicitely stated, perform on all nodes: > > # Prepare filesystem mountpoints > $ mkdir -p /mnt/brick/test > > # Mount bricks > $ mount /mnt/brick/test > > # Prepare brick roots (so lost+found won't end up in the volume) > $ mkdir -p /mnt/brick/test/glusterfs > > # Load FUSE > $ modprobe fuse > > # Prepare GlusterFS mountpoints > $ mkdir -p /mnt/glusterfs/test > > # Mount GlusterFS > # (we start with Node 3 which should become the metadata master) > node3 $ mount /mnt/glusterfs/test > node1 $ mount /mnt/glusterfs/test > node2 $ mount /mnt/glusterfs/test > > # While doing the tests, we watch the logs on all nodes for errors: > $ tail -f /var/log/glusterfs/test.log > > For each volume spec change, you have to unmount GlusterFS, change the > vol file, and mount GlusterFS again. Before starting tests, make sure > everything is running and the volumes on all nodes are attached (watch > the log files!). > > > Write the test-data for the read-only tests. These are lot's of 20K > files, which resemble most of our css/js/php/python files. You should > adjust this to match your workload... > ------------------------------------------------------------------------- > #!/bin/bash > mkdir -p /mnt/glusterfs/test/data > cd /mnt/glusterfs/test/data > for topdir in x{1..100} > do > mkdir -p $topdir > cd $topdir > for subdir in y{1..10} > do > mkdir $subdir > cd $subdir > for file in z{1..10} > do > dd if=/dev/zero of=20K-$RANDOM \ > bs=4K count=5 &> /dev/null && echo -n . > done > cd .. > done > cd .. > done > ------------------------------------------------------------------------- > > OK, in our case /mnt/glusterfs/test/data is now populated with around > ~240M of data... enough for some simple tests. > > Each test-run consists of this simplified simulation of sequentially > reading all files, listing dirs and probably doing a stat(): > > ------------------------------------------------------------------------- > $ cd /mnt/glusterfs/test/data > > # Always populate the io-cache first: > $ time tar cf - . > /dev/null > > # Simulate and time 100 concurrent data consumers: > $ for ((i=0;i<100;i++)); do tar cf - . > /dev/null & done; time wait > ------------------------------------------------------------------------- > > > OK, so here are the results. As stated, take them with a grain of salt. > Make sure you resemble your workload. For example, read-ahead is as we > see useless in this case but might improve performance for files with a > different size... :) > > > # All translators active except *7* (client io-threads after AFR) > real 2m27.555s > user 0m3.536s > sys 0m6.888s > > # All translators active except *13* (client io-threads at the end) > real 2m23.779s > user 0m2.824s > sys 0m5.604s > > # All translators active except *7* and *13* (no client io-threads!) > real 0m53.097s > user 0m3.512s > sys 0m6.436s > > # All translators active except *7*, *13* and only 8 io-threads in *3* # > instead of the default of 16 (server side io-threads) > real 0m45.942s > user 0m3.472s > sys 0m6.612s > > # All translators active except *3*, *7*, *13* (no io-threads at all!) > real 0m40.332s > user 0m3.776s > sys 0m6.424s > > # All translators active except *3*, *7*, *12*, *13* (no stat prefetch) > real 0m39.205s > user 0m3.672s > sys 0m6.084s > > # All translators active except *3*, *7*, *11*, *12*, *13* > # (no quickread) > real 0m39.116s > user 0m3.652s > sys 0m5.816s > > # All translators active except *3*, *7*, *11*, *12*, *13* and > # with page-count = 2 in *9* instead of 4 > real 0m38.851s > user 0m3.492s > sys 0m5.796s > > # All translators active except *3*, *7*, *9*, *11*, *12*, *13* > # (no read-ahead) > real 0m38.576s > user 0m3.356s > sys 0m6.076s > > > OK, that's it. Compare the results with all performance translators with > the final basic setup without any of the magic: > > with all performance translators: real 2m27.555s > without most performance translators: real 0m38.576s > > This is a _HUGE_ improvement! > > (disregard user and sys, they were practically the same in all tests) > > > Some final words: > > - don't add performance translators blindly (!) > - always test with a similar workload you will use in production > - never go and copy+paste a volume spec, then moan about bad performance > - don't rely on "glusterfs-volgen", it gives you just a starting point! > - less translators == less overhead > - read documentation for all options of all translators and get an idea: > http://www.gluster.com/community/documentation/index.php/Translators > (some stuff is still undocumented, but this is open source... so have a > look) > > > Best regards, > John Feuerstein > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > -- Raghavendra G