as fastmail.fm seems to be a very big setup of cyrus nodes, I would be
interested to know how you organized load balancing and managing disk
space.
Did you setup servers for a maximum of lets say 1000 mailboxes and then
you use a new server? Or do you use a murder installation so you can move
mailboxes to another server once a certain gets too much load? Or do you
have a big SAN storage with good mmap support behind an arbitrary amount
of cyrus nodes?
We don't use a murder setup. Two main reasons.
1) Murder wasn't very mature when we started
2) The main advantage murder gives you is a set of proxies (imap/pop/lmtp)
to connect users to the appropriate backends, which we ended up using other
software for, and a unified mailbox namespace if you want to do mailbox
sharing, something we didn't really need either. Also the unifed mailbox
needs a global mailboxes.db somewhere. As it was, because the skiplist
backend mmaps the entire mailboxes.db file into memory, and we had multiple
machines with 100M+ mailboxes.db files, I didn't really like the idea of
dealing with a 500M+ mailboxes.db file.
We don't use a shared SAN storage. When we started out we didn't have that
much money, so purchasing an expensive SAN unit wasn't an option.
What we have has evolved over time to our current point. Basically we now
have a hardware set that is quite nicely balanced with regard to spool IO vs
metadata IO vs CPU, and a storage configuration that gives us replication
with good failure capability, but without having to waste lots of hardware
on just having replica machines.
IMAP/POP frontend - We used to use perdition, but have now changed to nginx
(http://blog.fastmail.fm/?p=592). As you can read from the linked blog post,
nginx is great.
LMTP delivery - We use a custom written perl daemon that forwards lmtp
deliveries from postfix to the appropriate backend server. It also does the
spam scanning, virus checking and a bunch of other in house stuff.
Servers - We use servers with attached SATA-to-SCSI RAID units with battery
backed up caches. We have a mix of large drives for the email spool, and
smaller faster drives for meta-data. That's the reason we sponsored the
metapartition config options
(http://cyrusimap.web.cmu.edu/imapd/changes.html).
Replication - We initial started with pairs of machines, half of each being
a replica and half a master replicating between each other, but that meant
on a failure, one machine became fully loaded with masters. masters take a
much bigger IO hit than replicas. Instead we went with a system we calls
"slots" and "stores". Each machine is divided into a set of "slots". "slots"
from different machines are then paired as a replicated "store" with a
master and replica. So say you have 20 slots per machine (half master, half
replica), and 10 machines, then if one machine fails, on average you only
have to distribute one more master slot to each of the other machines. Much
better on IO. Some more details in this blog post on our replication
trials... http://blog.fastmail.fm/?p=576
Yep, this means we need quite a bit more software to manage the setup, but
now that it's done, it's quite nice and works well. For maintenance, we can
safely fail all masters off a server in a few minutes, about 10-30 seconds a
store. Then we can take the machine down, do whatever we want, bring it back
up, wait for replication to catch up again, then fail any masters we want
back on to the server.
Unfortunately most of this software is in house and quite specific to our
setup, it's not very "generic" (e.g. it assumes particular disk layouts and
sizes, machines, database tables, hostnames, etc) to manage and track it
all, so it's not something we're going to release.
Rob
----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html