Cyrus, clusters, GFS - HA yet again

Janne Peltonen <janne.peltonen@xxxxxxxxxxx> · Fri, 27 Oct 2006 10:35:22 +0300

Hi list.

Sorry for the long post. I hope someone has time to read it and shed
some light on my concerns. This all boils down to one question: those
that have succeeded in running active-active Cyrus cluster configs, how
have you done it?

So. Some background:

I inherited a university imap system w/ abt 40k users a few weeks ago.
The system uses an old version of Cyrus; scalability has been achieved
through splitting the user-base according to their faculty, having a
separate hostname for each faculty and adding servers as needed.
Currently, we have three more-or-less independent servers, two of which
serve abt 16k users and the third, a couple thousand. This config runs
fairly well, but has its problems. For example, shared mailboxes with
users from faculties on different servers don't work - the users have to
be migrated by hand to the sorrect server. Also, if one server goes
down, all the people on that server see a break on their mail service.
And if further splitting of users on more servers is needed - downtime
again. Moreover, it's confusing for the users to have to determine their
correct imap server name - we haven't really had trouble with this, but
it would be nice if the users saw a unified system image.

As a solution, my predecessor had considered creating an active-active
cluster system, with Cyrus mailspool and config on shared GFS, load
balancing through a 'magic box' that arbitrates incoming connections
among the nodes, and each node running its own instance of Cyrus serving
all of the mailboxes. Such a setup would have the unified system image,
each of the nodes would be fully redundant, and adding new nodes would
be simple - the nodes could be more or less identical... However, my
predecessor left to complete his studies before even creating a test
environment for these ideas.

For the last few weeks, I've been reading through Cyrus documentation,
GFS documentation, RedHat cluster suite documentation, mailing list
archives, whatnot, trying to find out whether my predecessor's solution
is achievable. To no avail: I found lots of warnings against ever
running independent Cyri on the same filesystem, and counterexamples of
people having successfully run precisely such a beast for years. The
issue seemed to boil down to having completely functional and efficient
file locking and mmaping semantics - GFS should have them.  I found the
test conducted by the Italian group that showed that GFS doesn't perform
as well as its commercial counterparts in email-like situations -
however, my system is abt an order of magnitude smaller than theirs, so
I didn't consider it an issue.

So I set up my test environment: two HP Blade servers, both with their
own disks for system and two SAN-shared block devices for Cyrus; CentOS
4.4; kernel-2.6.9-42.0.2.EL; GFS-6.1.6-1; dlm-1.0.1-1; cman-1.0.11-0;
cyrus-imapd-2.2.12-3.RHEL4.1 (the revisions are those of the rpm
packages from CentOS). /var/lib/imap (config dir) and /var/spool/imap
(spooldir) are two GFS filesystems on cLVM on SAN. To keep the testing
simple, I didn't set the Cyri up as cluster services; I simply set the
cluster up and running with no services, mounted the GFS's on both nodes
and started Cyrus on both nodes.

Enter weirdness. The first Cyrus to be started starts with no
complaints and ends up with the correct number (as specified in
/etc/cyrus.conf) of imapd, imapd -s, pop3d, lmtp etc. processes, all in
state S, only one process at a time having a write lock on
/var/lib/imap/socket/xxx-N.lock. The Cyrus on the other node starts with
only two versions of each imapd etc. process (except non-secure pop3d;
all of them run); all those end up in state D. And each of them holds a
write lock on /var/lib/imap/socket/xxx-N.lock (or, more probably, tries
to acquire the lock from the equivalent process on the other node; the
process doesn't seem to give it up, though). And if I log in on the
Cyrus on node 1 while the Cyrus on node 2 is running, the imapd on node 1
complains about database corruption after I log out. I don't know if any
database is really corrupted.

It seems to me that running two Cyri on different nodes with a shared
configdir doesn't work even if we have the required mmap and locking
semantics: it appears that someone (perhaps the cyrus-master process?)
arbitrates the locks in the <configdir>/socket directory, using other
means of communication than the shared filesystem. lsof'ing the master
and imapd's shows that there is a pipe between the master process and
each its child. Might it be that they communicate via it?

Now. How do you run two instances of Cyrus on the same filesystem? Is
there a config option I'm missing? Or should I just give up and start
considering Murder?

Greetings,

--Janne Peltonen
Univ of Helsinki
Imap admin
----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html