Re: load balancing at fastmail.fm

"Rob Mueller" <robm@xxxxxxxxxxx> · Tue, 13 Feb 2007 08:19:36 +1100

Fastmail dont use SAN, as I understand they use external raid arrays.
There are many ways to lose your data, one of these being filesystem 
error, others being software bugs and human error. Block-level replication 
(typically used in SANs) is very fast and uses few resources but doesnt 
protect from filesystem error (although it can offer instant recovery).
If it's using block level replication, how does it offer instant recovery on 
filesystem corruption? Does it track every block written to disk, and can 
thus roll back to effectively what was on disk at a particular instant in 
time, so you then just remount the filesystem and the replay of the journal 
should restore to a good state?
File-level replication is somewhat more resilient and easier to monitor, 
but is just as prone to human errors, bugs, misconfigurations, etc.
Any replication system is prone to human errors and bugs, the most common 
one being "split brain syndrome" which is pretty much possible with any 
replication system regardless of which approach it uses if you stuff up. 
Which is why good tools and automation that ensure you can't stuff it up are 
really important! :)
There will be horror stories for every given system in the world. 
Generally speaking ext3 is very reliable, but naturally no filesystem is 
going to remove the need for replication and no replication system is 
going to remove the need for backups.
Indeed. Which is what we have, a replicated setup with nightly incremental 
backups. And things like filesystem or LVM snapshots are NOT backups, 
they're still relying on the integrity of your filesystem, rather than being 
on completely separate storage.
The main thing we were trying to avoid was single points of failure.

With a SAN, you generally have a very reliable, though very expensive 
central data store, but it's still a single point of failure, and even 
better you're dealing with some closed system you have to rely on a vendor 
for support for. That may or may not be a good thing depending on your point 
of view. You still have the SAN as a single point of failure
With block based replication, you get the hardware redundancy, but you still 
have the filesystem as a single point of failure. If master end gets 
corrupted (eg http://oss.sgi.com/projects/xfs/faq.html#dir2) the other end 
replicates the corruption.
With file based replication, about your only way of failure is the 
replication software going crazy blowing both sides away somehow, which 
given that the protocol is strictly designed to be one way, seems extremely 
unlikely that anything will happen to the master side.
Rob

PS. As a separate observation, if you're looking to get performance out of 
cyrus with a large number of users in a significantly busy environment, 
don't use ext3. We've been using reiserfs for years, but after the SUSE 
announcement, decided to try ext3 again on a machine. We had to switch it 
back to reiserfs, the load difference and visible performance difference for 
our users was quite large. And yes we tried with dirindex and various 
journal options. None of them came close to matching the load and response 
times of our standard reiser mount options; 
noatime,nodiratime,notail,data=journal, but read these first:
http://www.irbs.net/internet/info-cyrus/0412/0042.html
http://lists.andrew.cmu.edu/pipermail/info-cyrus/2006-October/024119.html

----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html