Re: choosing a file system

Bron Gondwana <brong@xxxxxxxxxxx> · Sat, 10 Jan 2009 20:56:00 +1100

On Fri, Jan 09, 2009 at 05:20:02PM +0200, Janne Peltonen wrote:
> I've even been playing a little with userland ZFS, but it's far from
> usable in production (was a nice little toy. though, and a /lot/ faster
> than could be believed).

Yeah - zfs-on-fuse is not something I'd want to trust production data
to. 

> I think other points concerning why not to change to another OS
> completely for the benefits available in ZFS were already covered by
> Bron, so I'm not going to waste bandwidth any more with this matter. :)

I did get a bit worked up about it ;)

Thankfully, I don't get confronted with fsck prompts very often, because
my response to fsck required is pretty simple these days :)

a) it's a system partition - reinstall.  Takes 10 minutes from start to
   finish (ok, 15 on some of the bigger servers, POST being the extra)
   and doesn't blat data partitions.

   Our machines are installed using FAI to bring the base operating
   system up and install the "fastmail-server" Debian package, which
   pulls in all the packages we use as dependencies.  It then checks
   out the latest subversion repository and does "make -C conf install"
   which sets up everything else.

   This is all per-role and per machine configured in a config file
   which contains lots of little micro languages optimised for being
   easy to read in a 'diff -u', since that's what our subversion
   commit hook emails us.

b) if it's a cyrus partition, nuke the data and meta partitions and
   re-sync all users from the replicated pair.

c) if it's a VFS partition, nuke it and let the automated balancing
   script fill it back up in its own time (this is the nicest one,
   all key-value based with sha1.  I know I'll probably have to
   migrate the whole thing to sha3 at some stage, but happy to wait
   until it's finalised)

d) oh yeah, mysql.  That's replicated between two machines as well,
   and dumped with ibbackup every night.  If we lose one of these
   we restore from the previous night's backup and let replication
   catch up.  It's never happened (yet) on the primary pair - I've
   had to rebuild a few slaves though, so the process is well tested.

So - no filesystem is sacred.  Except for bloody out1 with its 1000+
queued postfix emails and no replication.  It's been annoying me for
over a year now, because EVERYTHING ELSE is replicated.  We've got
some new hardware in place, so I'm investigating drbd as an option
here.  Not convined.  It still puts us at the mercy of a filesystem
crash.  

I'd prefer a higher level replication solution, but I don't know 
any product that replicates outbound mail queues nicely between
multiple machines in a way that guarantees that every mail will be
delivered at least once, and if there's a machine failure the only
possible failure mode is that the second machine isn't aware that
the message hasn't been delivered yet, so delivers it again.  That's
what I want.

I'd also like a replication mode for our IMAP server that guaranteed
the message was actually committed to disk on both machines before
returning OK to the lmtpd or imapd.  That's a whole lot of work
though.

(we actually lost an entire external drive unit the other day, and
had to move replicas to new machines.  ZFS wouldn't have helped here,
the failure was hardware.  We would still have had perfectly good
filesystems that were offline.  Can't serve up emails while offline)

Bron.
----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html