I apologize for not responding sooner here. I've had my head down in the code and doing some tests, including playing with Bron's patch here. I haven't had the guts to roll the patched, CVS version into production as our primary mupdate server, but I did put it in on a test machine in replica mode. My measurement was on a clean server (no pre-existing mailboxes.db), and it didn't appear noticeably faster. I haven't measured hard numbers, but it was still well over 10 minutes to complete the sync and write it out to disk. The odd thing is that we see major performance differences depending on what disk the client is living on. For instance, if we put the mailboxes.db (and the whole metapartition) on superfast Hitachi disks over a 4 GB SAN connection, the sync will finish in just under three minutes. Still, even though we see that big difference, we don't see any kind of I/O contention in the iostat output. the k/sec figures are well within what the drives should be able to handle, and the % blocking stays in low single digits most of the time, while peeking up in the 15-25 range from time to time, but not staying there. It does make me wonder if what we're seeing is related to I/O latency. I haven't delved deep into the skiplist code, but I almost wonder if at least some of the slowness is the foreach iteration on the mupdate master in read mode. On all systems in the murder, we'll see instances where the mupdate process goes into a spin where, in truss, it's an endless repeat of fcntl, stat, fstat, fcntl, thousands of times over. These execute extremely quickly, but I do wonder if we're assuming that something that takes very little time takes an insignificant amount of time, when the time involved becomes significant on an 800k mailboxes database. Finally, as to how we get into this situation in the first place, it appears to happen when the mupdate master, in our environment and configuration, can handle having up to three replicas connected to it before it goes into a bad state during high load. I've never caught it at the point of actually going downhill, but my impression is that so many processes start demanding responses from the mupdate server that the persistent connections that the slave mupdates have to the master timeout and disconnect, then reconnect and try to re-sync. (At least that's what it looks like in the logs.) Incoming IMAP connections won't do it, but lmtpproxy connections seem to have a knack for it, since for whatever reason they appear to generate "kicks" at a pretty high rate. Still looking, but open to suggestions here. Michael Bacon UNC Chapel Hill --On October 20, 2009 12:54:45 PM +1100 Bron Gondwana <brong@xxxxxxxxxxx> wrote: > > > On Mon, 19 Oct 2009 16:38 -0400, "Michael Bacon" <baconm@xxxxxxxxxxxxx> > wrote: >> When we spec'ed out our servers, we didn't put much I/O capacity into >> the front-end servers -- just a pair of mirrored 10k disks doing the >> OS, the logging, the mailboxes.db, and all the webmail action going on >> in another solaris zone on the same hardware. We thought this was >> sufficient given the fact that no real permanent data lives on these >> servers, but it turns out that while most of thie time it's fine, if >> the mupdate processes ever decide they need to re-sync with the master, >> we've got 6 minutes of trouble >> ahead while it downloads and stores the 800k entries in the mailboxes.db. > > Have you checked if it's actually IO limited? Reading the code, it > appears to do the entire sync in a single transaction, which is bad > because it locks the entire mailboxes.db for the entire time. > >> During these sync periods, we see two negative impacts. The first is >> lockup on the mailboxes.db on the front-end servers, which slows down >> both >> accepting new IMAP/POP connections and the reception of incoming >> messages. >> (The front-ends also accept LMTP connections from a separate pair of >> queueing hosts, then proxy those to the back-ends.) The second is that, >> because the front-ends go into a > > Lost you there - I'm assuming it causes a nasty load spike when it > finishes too. Makes sense. > >> I suppose this is Fastmail and others ripped out the proxyd's and >> replaced >> them with nginx or perdition. Currently we still support GSSAPI as an >> auth >> mechanism, which kept me from going that direction, but given the >> problems >> we're seeing, I'd be open to architectural suggestions on either how to >> tie >> perdition or nginx to the MUPDATE master (because we don't have the >> back-ends split along any discernable lines at this point), or >> suggestions >> on how to make the master-to-frontend propagation faster or less painful. > > We didn't ever go with murder. All our backends are totally independent. > >> Sorry for the long message, but it's not a simple problem we're fighting. > > No - it's not! I wonder if a better approach would be to batch the > mailboxes.db updates into groups of no more than (say) 256. > > Arrgh - stupid, stupid, stupid. Layers of abstraction mean we have a nice > fast "foreach" going on, and then throw away the data and dataptr fields, > followed by which we fetch the data field again. It's very inefficient. > I wonder what percentage of the time is just reading stuff from the > mailboxes.db? > > Anyway - the bit that's actually going to be blocking you will be the > mailboxes.db transactions. I've attached a patch. Advance warning - I > don't use murder, so I haven't done more than compile test it! It SHOULD > be safe though, it just commits to the mailboxes.db every 256 changes and > then closes the transaction, which means that things that were queued > waiting for the lock should get a chance to run before you update the > next 256 records. > > The patch is against current CVS (well, against my git clone of current > CVS anyway) > > Bron. > -- > Bron Gondwana > brong@xxxxxxxxxxx > ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html