Brian - thank you for sharing these configuration tips. I'd love to have that in a blog post :) As a close second, perhaps you could post a mini Q&A on community.gluster.org? This is the type of information that's very useful for google to index and make available. Thanks, JM ----- Original Message ----- > On Tue, Feb 07, 2012 at 09:59:44AM +0100, Carsten Aulbert wrote: > > (1) two servers with raid0 over all 12 disks, each serving as a > > single storage > > brick in simple replicated setup. > > I am doing some similar tests at the moment. > > 1. What's your stripe size? If your files are typically 4MB, then > using a > 4MB or larger stripe size will mean that most requests are serviced > from a > single disk. This will give higher latency for a single client but > leave > lots of spindles free for other concurrent clients, maximising your > total > throughput. > > If you have a stripe size of 1MB, then each file read will need to > seek on 4 > disks. This gives you longer rotational latency (on average close to > a full > rotation instead of 1/2 a rotation), but 1/4 of the transfer time. > This > might be a good tradeoff for single clients, but could reduce your > total > throughput with many concurrent clients. > > Anything smaller is likely to suck. > > 2. Have you tried RAID10 in "far" mode? e.g. > > mdadm --create /dev/md/raid10 -n 12 -c 4096 -l raid10 -p f2 -b > internal /dev/sd{h..s} > > The advantage here is that all the data can be read off the first > half of > each disk, which means shorter seek times and also higher transfer > rates > (the MB/sec at the outside of the disk is about twice the MB/sec at > the > centre of the disk) > > The downside is more seeking for writes, which may or may not pay off > with > your 3:1 ratio. As long as there is write-behind going on, I think it > may. > > Since each node has RAID10 disk protection then you could use a > simple > distributed setup on top of it (at the cost of losing the ability to > take > a whole storage node out of service). Or you could have twice as many > disks. > > 3. When you mount your XFS filesystems, do you provide the 'inode64' > mount > option? This can be critical for filesystems >1TB to get decent > performance, as I found out the hard way. > http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F > > "noatime" and "nodiratime" can be helpful too. > > 4. Have you tuned read_ahead_kb and max_sectors_kb? On my system > defaults > are 128 and 512 respectively. > > for i in /sys/block/sd*/bdi/read_ahead_kb; do echo 1024 >"$i"; done > for i in /sys/block/sd*/queue/max_sectors_kb; do echo 1024 >"$i"; > done > > 5. Have you tried apache or apache2 instead of nginx? Have you done > any > testing directly on the mount point, not using a web server? > > > Ideally, I'd like to have a set-up, where multiple relatively cheap > > computers > > with say 4 disks each run in raid0 or raid 10 or no raid and export > > this via > > glusterfs to our web server. Gluster's replication will serve as > > kind of fail- > > safe net and data redistribution will help, when we add more > > similar machines > > later on to counter increased usage. > > I am currently building a similar test rig to yours, but with 24 disk > bays > per 4U server. There are two LSI HBAs, one 16 port and one 8 port. > > The HBAs are not the bottleneck (I can dd data to and from all the > disks at > once no problem), and the CPUs are never very busy. One box has an > i3-2130 > 3.4GHz processor (dual core hyperthreaded), and the other a Xeon > E3-1225 > 3.1GHz (quad core, no hyperthreading) > > We're going this way because we need tons of storage packed into a > rack in a > constrained power budget, but you might also find that fewer big > servers are > better than lots of tiny ones. I'd consider at least 2U with 12 > hot-swap > bays. > > I have yet to finish my testing, but here are two relevant results: > > (1) with a single 12-disk RAID10 array with 1MB chunk size, shared > using > glusterfs over 10GE to another machine, serving files between 500k > and 800k, > from the client I can read 180 random files per second (117MB/s) with > 20 > concurrent processes, or 206 random files per second (134MB/s) with > 30 > concurrent processes. > > For comparison, direct local access to the filesystem on the RAID10 > array > gives 291 files/sec (189MB/sec) and 337 files/sec (219MB/sec) with 20 > or 30 > concurrent readers. > > However, the gluster performance at 1/2/5 concurrent readers tracks > the > direct RAID10 closely, but falls off above that. So I think there > may be > some gluster concurrency tuning required. > > (2) in another configuration, I have 6 disks in one server and 6 in > the > other, with twelve separate XFS filesystems, combined into a > distributed > replicated array (much like yours but with half the spindles). The > gluster > volume is mounted on one of the servers, which is where I run the > test, so 6 > disks are local and 6 are remote. Serving the same corpus of files I > can > read 177 random files per second (115MB/s) with 20 concurrent > readers, or > 198 files/sec (129MB/s) with 30 concurrent readers. > > The corpus is 100K files, so about 65GB in total, and the machines > have 8GB > RAM. Each test drops caches first: http://linux-mm.org/Drop_Caches > > I have no web server layer in front of this - I'm using a ruby script > which > forks and fires off 'dd' processes to read the files from the gluster > mountpoint. > > However I am using low performance 5940 RPM drives (Hitachi Deskstar > 5K3000 > HDS5C3030ALA630) because they are cheap, use little power, and are > reputedly > very reliable. If you're using anything better than these you should > be > able to improve on my numbers. > > I haven't compared to NFS, which might be an option for you if you > can live > without the node-to-node replication features of glusterfs. > > Regards, > > Brian. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >