So here's the story of the UC Davis (no, not Berkeley) Cyrus conversion..... We have about 60K active accounts, and another 10K that are forwards, etc. 10 UWash servers that were struggling to keep up with a load that was 2006 running around 2 million incoming emails a day, before spam dumpage, etc. Moving from this UWash setup that was pretty grody, with each chunk of 4K-5K users knowing what "color" server was theirs. Like if you had an account jsmith1 you could look up on our web-site that you connected to yellow.ucdavis.edu or whatever. Moving accounts was a giant PITA because if we had to move people they had to be notified they were moving to a new "color". We knew we had to do SOMETHING and during an academic year you politically CANNOT just plonk in a Cyrus Murder and say go, so we started putting new infrastructure up to ease our eventual move. 1st STEP: Perdition mail-proxies ========================= We setup 4 Sun X2100 with RHEL4 running Perdition to answer to mail.ucdavis.edu and redirect users to the right backend. These are in a load-balanced pool and 2 can handle the load most days. Initially we were pulling the redirects from a NIS map, then later from LDAP. 2nd STEP: LDAP infrastructure ======================== So we want all delivery lookups and Perdition redirect to come from LDAP. We created another load-balanced pool with 4 little Sun V210 running Sun Directory Server 5.2 with a pair as hubs and a pair as consumers. So far they have held up admirably to our needs. The Sun course for DS is great by the way for fine-tuning performance tips. 3rd STEP: MX pulling from LDAP =========================== Modified our MX pool systems to pull from LDAP instead of NIS. This went without incident, although we did see occasional lookup errors before we started tuning the LDAP servers to increase threads. 4th STEP: Cyrus infrastructure creation ============================ After much consultation with other universities we decide on Sun systems in the backend mail-stores. We went with Sun T2000 with 2 HBA wired to separate SAN switches, with dual 3510FC arrays. The intent was to have a Sun Cluster 3.2 setup in failover mode so if any single major component failed there would be no service interruption. We had background already in Sun HA and similar so it seemed less man-hours than starting over with a Linux based solution and trying to cobble something similar together. We went with ZFS as our filesystem format for Cyrus storage and this has worked out well. The snapshots and ability to survive minor disk write errors in a mirrored setup like we have, let us all sleep a lot better. I recall 6 times in the prior year where UFS errors gave us grief. We ran a truckload of benchmarks against our hardware using SLAMD and it seemed to indicate we were in great shape for 30K-40K users per cluster. 5th STEP: Cyrus migration ==================== The politics of educational environment is that you MUST do massive changeouts like this during summer quarter. So the last couple of months of summer we were busily migrating all the UWash users to Cyrus. About 29K users to ms1, and 23K users to ms2. Everything worked great. Typically about 500 Cyrus processes running. 6th STEP: The excrement hits the rotating blades =================================== About a week before classes actually start is when all the kids start moving back into town and mailing all their buds. We saw process numbers go from 500-ish to as high as 5,000. Load would climb radically after passing 2,000 processes and systems became slow to respond. This persisted for 4 days with us on the phone with Ken & Jeff and anyone else who would talk to us, trying to find the right tweaks on the Cyrus software. We tried moving to quota-legacy and using BDB for delivery database a few other tweaks suggested, but none brought us substantial relief. We are running as high as 4 million emails arriving a day now, which is about double of last year, with say 1-1.5 million being dumped by the virus/spam filtering. So about 2-2.5 million arriving to backend mailstores. Meanwhile I was scavenging the bones of the UWash infrastructure and rebuilt them as Cyrus systems. So we migrated some users BACK off our big new boxes onto smaller ones. The magic point seemed to be below 15K users on a T2000 we were fine. We are all Cyrus now so the migrations are considerably less difficult. 7th STEP: Post-Mortem =================== I don't know what goes wrong precisely although we have lots of speculation. I do know that the processes were piling up. We have various candidates for this in both the OS and the application. There is some bottleneck for a resource that once it reaches a certain busy-ness level, everything starts backing up. No amount of dtrace or truss fiddling pinned it down further than possibly locking issues on the many databases. Frankly we ran out of time due to user pressure after 4 days to dig completely to the bottom of it. I would caution large sites in future that more than 10K users per backend with a high email volume is heading into unknown territory. We have talked to a few sites carrying 30K and up users per system, but so far all with much less email activity levels. We are actually having a Sun Engineer on-site in a few days and will try to see if we can pinpoint some issues, or at least find usable workarounds on our hardware such as Zones. The theory being 2 or 3 Zones on a T2000 with say 10K users each, would still let us accomodate the same number of users on the new hardware that we had originally targetted. I append the comments of one of our people with local theory: ------------ Omen Wild (University of California Davis) The root problem seems to be an interaction between Solaris' concept of global memory consistency and the fact that Cyrus spawns many processes that all memory map (mmap) the same file. Whenever any process updates any part of a memory mapped file, Solaris freezes all of the processes that have that file mmaped, updates their memory tables, and then re-schedules the processes to run. When we have problems we see the load average go extremely high and no useful work gets done by Cyrus. Logins get processed by saslauthd, but listing an inbox either takes a long time or completely times out. Apparently AIX also runs into this issue. I talked to one email administrator that had this exact issue under AIX. That admin talked to the kernel engineers at IBM who explained that this is a feature, not a bug. They eventually switched to Linux which solved their issues, although they did move to more Linux boxes with fewer users per box. We have supporting evidence in the fact that sar shows that the %sys CPU time is 2-3x the %usr CPU time: 00:02:01 %usr %sys %wio %idle [ snip ] 11:17:01 3 7 0 90 11:32:02 3 7 0 90 11:47:01 3 7 0 90 12:02:01 3 7 0 91 12:17:01 3 7 0 90 12:32:01 3 6 0 91 12:47:01 3 6 0 91 13:02:01 3 6 0 92 13:17:01 3 6 0 91 13:32:01 3 6 0 92 13:47:02 3 6 0 92 14:02:01 3 6 0 91 14:17:01 3 7 0 90 14:33:54 2 4 0 94 14:47:01 4 10 0 86 ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html