>>> On Wed, 12 Mar 2008 17:27:58 -0500, Bryan Mark Mesich >>> <bmesich@xxxxxxxxxxxxxxxxxxxxxxxxxx> said: bmesich> [ ... ] performance of our IMAP mail servers that have bmesich> storage on-top RAID 5. [ ... ] That may be not a good combination. I generally dislike RAID5, but even without being prejudiced :-), RAID5 is suited to a mostly-read load, and a mail store is usually not mostly-read, because it does lots of appends. In particular it does lots of widely scattered appends. As usual, I'd rather use RAID10 here. Most importantly, the structure of the mail store mailboxes matters a great deal e.g. whether it is mbox-style, or else maildir-style, or something else entirely like DBMS-style. bmesich> During peek times of the day, a single IMAP box might bmesich> have 500+ imapd processes running simultaneously. The 'imapd's are not such a big deal, the delivery daemons may be causing more trouble, and the interference between the two, and the type of elevator. As to elevator in your case who knows which would be best, a case could be made for 'anticipatory', another one for 'deadline', and perhaps 'noop' is the safest. As usual, flusher parameters are also probably quite important. Setting the RHEL 'vm/max_queue_size' to a low value, something like 50-100 in your case, might be useful. Now that it occurs to me, another factor is whether your users access the mail store mostly as a download area (that is mostly as they would if using POP3) or they actually keep their mail permanently on it, and edit the mailboxes via IMAP4. In the latter case the reliability of the mail store is even more important, and the write rates even higher, so I would recommend RAID10 even more strongly. If you think that RAID10 costs too much in WASTED capacity, good luck! :-) Or you could investigate whether your IMAP server can do compressed mailboxes. You got plenty of CPU power, more so probably relative to your network speed. bmesich> I'm currently testing with the following: bmesich> Intel SE7520BD2 motherboard bmesich> (2) 3Ware PCI-E 9550SX 8 port SATA card Pretty good. bmesich> 1 GB of memory Probably ridiculously small. Sad to say... bmesich> (2) Core2Duo 3.0GHz bmesich> (16) Segate 750GB Barracuda ES drives bmesich> RHEL 5.1 server (stock 2.6.18) Pretty good. Those 16x750GB look like *perfect* for a nice sw RAID10, with 8 pairs, with each member of a pair on a different 9550SX. bmesich> I've setup 3 RAID5 arrays arranged in a 3+1 layout. I bmesich> created them with different chunk sizes (64k, 128k, and bmesich> 256k) for testing purposes. Chunk size in your situation is the least of your worries. Anyhow it depends on the structure of your mail store. bmesich> Write-caching has been disabled (no battery) on the bmesich> 3Ware cards That can be a very bad idea, if that also disables the builtin cache of the disks. If the ondisk cache is enabled it probably matters relatively little. Anyhow for a system like yours doing what it does I would consider battery backup *for the whole server* pretty important. bmesich> and I'm using ext3 as my filesystem. That's likely to be a very bad idea. Consider just this: your 3+1 arrays have one 3x750GB filesystem each (I guess). How long could 'fsck' of one of those take? You really don't want to know. Depending on mail store structure I'd be using ReiserFS, or JFS or even XFS. My usual suggestion is to use JFS by default unless one has special reasons. There may well be special reasons! In your case ReiserFS would be rather better if the mail store was organized as a lot of small files, and XFS if it was organized as large mail archives files, for example. XFS also has the advantages that it supports write barriers (but not sure if the one in 2.6.18 already does), so you could probably enable the host adapter cache, and that it handles well very parallel access patterns. It has the disadvantage that it can require several GB of memory to 'fsck' (like 1GB per 1TB of filesystem, or more), and does not work as well with lots of small files (while ReiserFS is very good, and JFS not too bad). bmesich> When creating the filesystems, I used sensible stride bmesich> sizes and disabled directory indexing. That's very wise, both of that. bmesich> I ran bonnie 1.4 on 2 of the filesystems with the following results: ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)- bmesich> Chunk size = 64k bmesich> ./Bonnie -d /mnt/64/ -s 1024 -y -u -o_direct bmesich> MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU bmesich> 1*1024 59185 50.9 21849 7.5 14490 5.0 16377 24.1212812 25.3 267.8 1.5 bmesich> Chunk size = 256k bmesich> ./Bonnie -d /mnt/256/ -s 1024 -y -u -o_direct bmesich> MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU bmesich> 1*1024 47650 40.6 22561 6.8 19019 6.9 16872 22.2209770 23.7 267.2 1.5 So you are getting 45-60MB/s writing and 210-220MB/s reading. The reading rate is roughly reasonable (each of those disks can do 70-80MB/s on the outer tracks), but the write speed is pretty disastrous. Probably like many others you are using RAID5 without realizing the pitfalls of parity RAID (amply explained in some recent threads). Such pitfalls are particularly bad also if the access patterns involve lots of small writes. This is what I get on a 4x(1+1) RAID10 (with 'f2' for better read performance, I would suggest the default 'n2' in your case) with mixed 400GB and 1TB disks (and 'blockdev --setra 1024', regrettably as detailed in a recent message of mine): # Bonnie -y -u -o_direct -s 2000 -v 2 -d /tmp/a Bonnie 1.4: File '/tmp/a/Bonnie.27318', size: 2097152000, volumes: 2 Using O_DIRECT for block based I/O Writing with putc_unlocked()...done: 176107 kB/s 79.0 %CPU Rewriting... done: 31797 kB/s 3.1 %CPU Writing intelligently... done: 243844 kB/s 9.6 %CPU Reading with getc_unlocked()...done: 22424 kB/s 28.3 %CPU Reading intelligently... done: 475166 kB/s 14.9 %CPU Seek numbers calculated on first volume only Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done... ---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek- -CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (03)- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU serv02 2*2000 176107 79.0243844 9.6 31797 3.1 22424 28.3475166 14.9 382.1 1.0 Note however that the seek rates are not much higher than yours, more or less of course. bmesich> OK...so now I have some benchmarks, but I'm not sure if bmesich> it remotely relates to normal IO on a busy IMAP server. I think that's unlikely -- Bonnie is a good test of the limits of a _storage system_, not particularly of any given application usage pattern (unless that application looks a lot like Bonnie). bmesich> I would expect an IMAP server to have many relatively bmesich> small random reads and writes. Perhaps -- but it all depends on the structure of the mail store and whether the users download mail or keep their mailboxes on the server, and how big those mailboxes tend to be. bmesich> With this said, has anyone ever tried tuning a RAID5 bmesich> array to a busy mail server (or similar application)? Note a little but important point of terminology: a mail server and a mail store server are two very different things. They may be running on the same hardware, but that's all. bmesich> An ever better question would be how a person can go bmesich> about benchmarking different storage configurations bmesich> that can be applied to a specific application. [ ... ] bmesich> Should I measure throughput or smiling email users :) Your application is narrow enough. There are mail-specific benchmarks, e.g. Postmark, but they tend to be for mail servers, not mail store servers. A mail store server is in effect though a file server, even if the protocol is IMAP4 rather than SMB or NFS or WEBDAV. But file size, number, and access patterns matter. Thinking of file server, the hundreds of IMAP daemons and the size of the mail store point to a large concurrent user base. I would dearly hope that you have several good (with a fair bit of offloading) 1gb/s interfaces with load balancing across them (either bonding ro ECMP), or at least one 10gb/s interface, and a pretty good switch/router/network, and your have set the obvious TCP parameters for high speed network transfer over high bandwidth links. If your users are typical contemporary ones and send each other attachements dozens of megabytes long, a single 1gb/s interface that can do 110MB/s with the best parameter is not going to be enough. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html