--On Tuesday, October 16, 2007 3:39 PM -0700 Vincent Fox <vbfox@xxxxxxxxxxx> wrote:
------------ Omen Wild (University of California Davis) The root problem seems to be an interaction between Solaris' concept of global memory consistency and the fact that Cyrus spawns many processes that all memory map (mmap) the same file. Whenever any process updates any part of a memory mapped file, Solaris freezes all of the processes that have that file mmaped, updates their memory tables, and then re-schedules the processes to run. When we have problems we see the load average go extremely high and no useful work gets done by Cyrus. Logins get processed by saslauthd, but listing an inbox either takes a long time or completely times out. Apparently AIX also runs into this issue. I talked to one email administrator that had this exact issue under AIX. That admin talked to the kernel engineers at IBM who explained that this is a feature, not a bug. They eventually switched to Linux which solved their issues, although they did move to more Linux boxes with fewer users per box.
Oh man... Horrible memories just flood right back... Wow. I was reading your e-mail and thinking to myself that this sounded like the same problem we had. Then I got to the above section and *bam*, there it was... We had significant problems with our e-mail last year (this year was a perfect start!) a week before students came back. We didn't resolve the problems until the end of September and we were dismayed at our final solution. We run Tru64 5.1b on a 4 member cluster. Tru64's kernel suffers from the same exact issue as described above. We have regularly 12,000 cyrus procs running at any one time during the day, and that cluster also receives on average 300k-500k e-mails each day (that is after spam/virus work). What was finally identified was that the number of "processes" that were mapped to that single physical "executable" (/usr/cyrus/imapd) was causing a lot of lock contention in the kernel. The executable would have a link list of all the processes running off of it in kernel memory. When one of the processes would go away, the kernel would start at the beginning of the list and search for the process in order to clean up its resources. During that time, the kernel would lock everything and execution would essentially stop for everything (basically, the whole system appeared to simply freeze on us). The kernel would reach a time threshold and stop in order to let other things happen (unfreeze). This time was very short, but if we had a lot of processes going away in a very short period of time, we would noticeably see the freeze, since the kernel was going into this lock-down mode a lot in a very short period of time. That is a simplified view of what really happened. HP recommends that we keep the linked list down to only a few hundred processes at most. They were working on a kernel patch to make it a hash instead of a linked list in the kernel, but as they got deeper into the making of this patch, they found that it impacts a lot more than they initially realized. The last I heard, this might make it into the PK7 patch release, which is likely sometime next year. Meanwhile, we hacked around this in a very cool way. We copied the imapd process 60 times (assuming average of 12,000 processes, shooting for 200 processes per executable, that is 60 individual executables). These were named /usr/cyrus/bin/imapd_001 through /usr/cyrus/bin/imapd_060. We then symlinked the "imapd" binary to imapd_001. We then wrote a cron job that ran once a minute and relinked the imapd symlink to the next numbered executable, rotating around to imapd_001 when the end was reached. This worked like a charm and *all* of our problems went away... In fact, our system has continued to get busier and we are still running pretty good. I don't think the hack is ideal, but man, does it work! Scott -- +-----------------------------------------------------------------------+ Scott W. Adkins Work (740)593-9478 Fax (740)593-1944 UNIX Systems Engineer <mailto:adkinss@xxxxxxxx> +-----------------------------------------------------------------------+ PGP Public Key <http://edirectory.ohio.edu/?$search?uid=adkinss>
Attachment:
pgp0u9KdS5sKG.pgp
Description: PGP signature
---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html