Re: Implement Cyrus IMAPD in High Load Enviromment

Bron Gondwana <brong@xxxxxxxxxxx> · Tue, 29 Sep 2009 08:08:31 +1000

On Mon, Sep 28, 2009 at 08:59:43AM -0700, Vincent Fox wrote:
> Lucas Zinato Carraro wrote:
> >
> >- Exist a recommended size to a Backend server ( Ex: 1 Tb )?
> >
> Hardware-wise your setup is probably overkill.
> Nothing wrong with that.

Yeah, that's a fair few machines!  Nice to have space for it.

> Sizing of filesystems IMO should be based on your
> tolerance for long fsck during a disaster.  I run ZFS which
> has none of that and don't want to ever see it again on
> a mail-spool.  Linux journals IME reduce the probability of it
> but you will still find yourself looking at fsck prompt and
> having to decide:
>
> Y = hours of downtime while I make sure it's actuallly OK
> N = get it going, cross fingers.

Yeah, that's painful.  Thankfully with replication we don't have
it too bad.  Just run up a new replica and then blow away the old
one (ordering!  Nice to be able to recover _something_ if you lose
the other side of the replica somehow during the resync...)

> Most Linux admins don't turn on full data journalling anyhow
> quoting "performance reasons" they leave the default which is
> journalling metadata.  So you don't really know how your data
> is doing until it goes kablooey and you do an fsck with the
> filesystem unmounted.   I wouldn't go over 500 megs per FS
> until Linux has production BTRFS or something similar.

I assume you mean 500 gigs!  We're switching from 300 to 500 on new
filesystems because we have one business customer that's over 150Gb 
now and we want to keep all their users on the one partition for
folder sharing.  We don't do any murder though.

We run reiserfs (rw,noatime,nodiratime,notail,data=ordered)

> In ZFS the backups are trivial.  A script does a
> snapshot at 23:55 takes a few seconds to complete, then
> the backup is made of the most recent snapshot.  We keep
> 14 days of snapshots in the pool almost all recovery operations
> are satisfied from that without hitting tape.  The overhead of
> our snapshots increases storage about 50% but we are
> still FAR below max usage at only about 20% filled pools with
> LZJB compression in the meta dirs and gzip on the messages.

Yeah - that sounds pretty nice.  Our backups have a custom file streaming
and locking daemon (it can fcntl lock all the meta files and then stream
them together, to guarantee consistency) on each imap server.

The backup server pulls a list of users to backup from the database and
then forks I think 12 daemons at the moment, which grab users in batches
of 50 on a single drive unit and processes them - meaning that we don't
hammer any one drive unit too hard, but spread the load around pretty
randomly.

The backups are stored in a .tar.gz file (I think I've posted about the
internal format before, it's very UUID/UniqueID centred, so it handles
renames cheaply, and does single instance stores automatically because
the files have the same sha1), and there's an associated sqlite database
for easy lookup, but that can be re-created just by reading through the
tar file.

gzip is nice because you can concatenate multiple gzip files and the result
is uncompressible with a single gzip read, just possibly less efficiently
packed.  Tar is nice because everything is in 512 byte blocks.
We have a custom Perl module that can read and write tar files, and also 
modules that can read and write cyrus.index files.  So far I've only bothered
with read access to cyrus.header files, but it shouldn't be too hard to write
them either!

I really should productise this thing at some point!  It's a very nice backup
system, but it's quite hard-coded.  In particular, I should rewrite
backupcyrusd.pl as a C daemon that is managed by Cyrus, instead of something
standalone.

Bron.
----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html