> I suspect that the problem is with mailbox renames, which are not atomic > and can take some time to complete with very large mailboxes. I think there's some other issues as well. For instance we still see skiplist seen state databases get corrupted every now and then. It seems certain corruption can result in the skiplist code calling abort() which terminates the sync_server, and causes the sync_client to bail out. I had a back trace on one of them the other day, but the stack frames were all wrong so it didn't seem that useful. > HERMES_FAST_RENAME: > Translates mailbox rename into filesystem rename() where possible. > Useful because sync_client chdir()s into the working directory. > Would be less useful in 2.3 with split metadata. It would still be nice to do this to make renames faster anyway. If you did. 1. Add new mailboxes to mailboxes.db 2. Filesystem rename 3. Remove old mailboxes You end up with a race condition, but it's far shorter than the mess you can end up with at the moment if a restart occurs during a rename. > Together with my version of delayed expunge this pretty much guarantees > that things aren't moving around under sync_client's feet. Its been an > awful long time (about a year?) since I last had a sync_client bail out. > > We are moving to 2.3 over the summer (initially using my own original > replication code), so this is something that I would like to sort out. > > Any suggestions? I can try and keep an eye on bailouts some more, and see if I can get some more details. It would be nice if there was some more logging about why the bail out code path was actually called! Rob ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html