On Thu, May 10, 2007 at 08:30:11AM -0400, Nik Conwell wrote: > > On Apr 11, 2007, at 8:37 PM, Bron Gondwana wrote: > > >As for complexity? It's on the cusp. We've certainly had many more > >users on a single instance before, but we prefer to keep under 10k > >users > >per Cyrus instance these days for quicker recoverability. It really > > Hi - just a clarification question - when you say 10k users per > Cyrus instance and you mentioned in an earlier message each machine > hosts "multiple (in the teens) of this size stores," does this > include the replicas? So for example, one of your xSeries boxes > might host 16 instances, 8 master, 8 replica, so the box would > master about 80k users and provide replica backups for another 80K > users? Yes, your assumption is correct. We have both masters and replicas, though nothing like that organised! Each machine has replicas spread over as many different machines as possible (though for historical reasons there are a couple of pairings that are a bit busy - I'm working on splitting those up as we get new machines) ... that way we can fail all the masters off one machine without causing too much load on any one other machine, though it does mean we can only have one or two machines down at any one time, rather than up to half of them. We actually lost a controller chip in a RAID unit recently and our "hot spare" turned out to be broken as well, so we had a choice of leave replication down or expand into the spare slots we had sitting around. We wound up expanding. I have a script called sync_all_users which runs in tandem with monitorsync. Monitorsync runs from cron every 10 minutes and checks that sync_client processes are running correctly for each master slot on a machine. It will also run sync_client for any leftover files after a failure, email us about what's happening, restart the rolling replication, etc. It's very nice. It has locking which integrates with our failover script (which runs replication for any remaining log files after taking cyrus down) and etc. So sync_all_users runs a sync_client -u on every user who is in our database as "should be active on this machine", cleans out any logs which were written before it started and then starts rolling replication on all logs that were written since it started (you could do more clever stuff with alphabetical time stamping, but it's a bit of a pointless optimisation, it tends to catch up quickly when there's not much changed anyway). So it took maybe a day to be fully back up to date, which still isn't ideal, but it was a day of no downtime, just replica unsafety. Bron. ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html