Re: Clustering and replication

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 29 Jan 2007, Tom Samplonius wrote:

----- "Bron Gondwana" <brong@xxxxxxxxxxx> wrote:
On Fri, Jan 26, 2007 at 12:20:15PM -0800, Tom Samplonius wrote:

* the system monitoring scripts do a 'du -s' on the sync directory every
  2 minutes and store the value in a database so our status commands can
  see if any store is behind (the trigger for noticing is 10kb, that's a
  couple of minutes worth of log during the U.S. day).  This also emails
  us if it gets above 100kb (approx 20 mins behind)

And what do you do if it gets behind? I have three Cyrus groups right now, that are never going to catch up. They log about 20KB in 20 minutes, so the update rate is not that high. The machines are dedicated, and the replicas aren't doing anything. tcpdump confirms that there is traffic to the replica, but the entire sync_client is so opaque it is hard to see what it is doing. So sync_client can't keep up at all, and since it also quits from time to time, it gets even worse.

I'm planning to hack the log, and add some logging to sync_client, particularly to find the number of records per second it is able to process. And then maybe someway to find why it quits all the time.

Either that, or my only alternative is to switch to using DRBD to sync the filesystem to a standby server.

* a "monitorsync" process that runs from cron every 10 minutes and reads
  the contents of the sync directory, comparing any log-(\d+) file's PID
  with running processes to ensure it's actually being run and tries to
  sync_client -r -f the file if there isn't one.  It also checks that
  there is a running sync_client -r process (no -f) for the store.

Wow, a lot of protection to protect against sync_client just exiting. sync_client isn't very big, so it shouldn't be that hard to find the different places that it exits, and fix them?

* a weekly "checkreplication" script which logs in as each user to both
  the master and replica via IMAP and does a bunch of lists, examines,
  even fetches and compares the output to ensure they are identical.

Between all that, I'm pretty comfortable that replication is correct and
we'll be told if it isn't.  It's certainly helped us find our share
of issues with the replication system!

Well, I know our replicas are out of sync, so we just don't use them. I just hope the master's don't fail. Each pair has about 30,000 accounts, and about 300GB of online mail.

Tom,
in your situation you may want to seriously look at disabling fsync. doing so could let your replicas keep up.

it's definantly not ideal, but if I was forced to choose between

1. single box with fsync and no replica

or

2. master without fsync and replicas without fsync, but up to date

I would choose 2, as it won't loose any data due to a master failing, no matter what happens on the master, and I'm only vunerable to something that would take down both the primary and it's replica at the same time (don't have them both on the same UPS!)

David Lang
----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

[Index of Archives]     [Cyrus SASL]     [Squirrel Mail]     [Asterisk PBX]     [Video For Linux]     [Photo]     [Yosemite News]     [gtk]     [KDE]     [Gimp on Windows]     [Steve's Art]

  Powered by Linux