On Tue, Feb 07, 2012 at 09:59:44AM +0100, Carsten Aulbert wrote: > (1) two servers with raid0 over all 12 disks, each serving as a single storage > brick in simple replicated setup. I am doing some similar tests at the moment. 1. What's your stripe size? If your files are typically 4MB, then using a 4MB or larger stripe size will mean that most requests are serviced from a single disk. This will give higher latency for a single client but leave lots of spindles free for other concurrent clients, maximising your total throughput. If you have a stripe size of 1MB, then each file read will need to seek on 4 disks. This gives you longer rotational latency (on average close to a full rotation instead of 1/2 a rotation), but 1/4 of the transfer time. This might be a good tradeoff for single clients, but could reduce your total throughput with many concurrent clients. Anything smaller is likely to suck. 2. Have you tried RAID10 in "far" mode? e.g. mdadm --create /dev/md/raid10 -n 12 -c 4096 -l raid10 -p f2 -b internal /dev/sd{h..s} The advantage here is that all the data can be read off the first half of each disk, which means shorter seek times and also higher transfer rates (the MB/sec at the outside of the disk is about twice the MB/sec at the centre of the disk) The downside is more seeking for writes, which may or may not pay off with your 3:1 ratio. As long as there is write-behind going on, I think it may. Since each node has RAID10 disk protection then you could use a simple distributed setup on top of it (at the cost of losing the ability to take a whole storage node out of service). Or you could have twice as many disks. 3. When you mount your XFS filesystems, do you provide the 'inode64' mount option? This can be critical for filesystems >1TB to get decent performance, as I found out the hard way. http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F "noatime" and "nodiratime" can be helpful too. 4. Have you tuned read_ahead_kb and max_sectors_kb? On my system defaults are 128 and 512 respectively. for i in /sys/block/sd*/bdi/read_ahead_kb; do echo 1024 >"$i"; done for i in /sys/block/sd*/queue/max_sectors_kb; do echo 1024 >"$i"; done 5. Have you tried apache or apache2 instead of nginx? Have you done any testing directly on the mount point, not using a web server? > Ideally, I'd like to have a set-up, where multiple relatively cheap computers > with say 4 disks each run in raid0 or raid 10 or no raid and export this via > glusterfs to our web server. Gluster's replication will serve as kind of fail- > safe net and data redistribution will help, when we add more similar machines > later on to counter increased usage. I am currently building a similar test rig to yours, but with 24 disk bays per 4U server. There are two LSI HBAs, one 16 port and one 8 port. The HBAs are not the bottleneck (I can dd data to and from all the disks at once no problem), and the CPUs are never very busy. One box has an i3-2130 3.4GHz processor (dual core hyperthreaded), and the other a Xeon E3-1225 3.1GHz (quad core, no hyperthreading) We're going this way because we need tons of storage packed into a rack in a constrained power budget, but you might also find that fewer big servers are better than lots of tiny ones. I'd consider at least 2U with 12 hot-swap bays. I have yet to finish my testing, but here are two relevant results: (1) with a single 12-disk RAID10 array with 1MB chunk size, shared using glusterfs over 10GE to another machine, serving files between 500k and 800k, from the client I can read 180 random files per second (117MB/s) with 20 concurrent processes, or 206 random files per second (134MB/s) with 30 concurrent processes. For comparison, direct local access to the filesystem on the RAID10 array gives 291 files/sec (189MB/sec) and 337 files/sec (219MB/sec) with 20 or 30 concurrent readers. However, the gluster performance at 1/2/5 concurrent readers tracks the direct RAID10 closely, but falls off above that. So I think there may be some gluster concurrency tuning required. (2) in another configuration, I have 6 disks in one server and 6 in the other, with twelve separate XFS filesystems, combined into a distributed replicated array (much like yours but with half the spindles). The gluster volume is mounted on one of the servers, which is where I run the test, so 6 disks are local and 6 are remote. Serving the same corpus of files I can read 177 random files per second (115MB/s) with 20 concurrent readers, or 198 files/sec (129MB/s) with 30 concurrent readers. The corpus is 100K files, so about 65GB in total, and the machines have 8GB RAM. Each test drops caches first: http://linux-mm.org/Drop_Caches I have no web server layer in front of this - I'm using a ruby script which forks and fires off 'dd' processes to read the files from the gluster mountpoint. However I am using low performance 5940 RPM drives (Hitachi Deskstar 5K3000 HDS5C3030ALA630) because they are cheap, use little power, and are reputedly very reliable. If you're using anything better than these you should be able to improve on my numbers. I haven't compared to NFS, which might be an option for you if you can live without the node-to-node replication features of glusterfs. Regards, Brian.