Since you mention it... I took a look at a random frontend, and found 27 or 33 pop processes from two days ago. I used gdb to get stack traces from 3 samples, all looked like this: (gdb) where #0 0x008007a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x008d6ff3 in __read_nocancel () from /lib/tls/libc.so.6 #2 0x0806cb77 in prot_fill (s=0x8ecc148) at prot.c:470 #3 0x0806d924 in prot_fgets (buf=0xbff48160 "", size=2047, s=0x8ecc148) at prot.c:1186 #4 0x0804f57e in backend_connect (ret_backend=0x0, server=0x81045a0 "some.server", prot=0x80fad20, userid=0xbff49cb0 "someuser", cb=0x0, auth_status=0xbff48a40) at backend.c:477 #5 0x0804c8df in openinbox () at pop3d.c:1635 #6 0x0804d6d9 in cmdloop () at pop3d.c:1227 #7 0x0804e6ad in service_main (argc=2, argv=0x8e6e008, envp=0xbff4ebf8) at pop3d.c:579 #8 0x08052374 in main (argc=4, argv=0xbff4ebe4, envp=0xbff4ebf8) at service.c:540 #9 0x0082ee93 in __libc_start_main () from /lib/tls/libc.so.6 #10 0x0804ba81 in ?? () (gdb) In other words, they were all waiting in backend_connect() for the backend server. That's not what's going on in your case, tho. Looking at the code in backend_connect(), it's pretty clear that no timeout is set when retrieving the banner. That's a bug, and it impacts *every* tool that uses backend_connect() to communicate within the cluster. It may not be your problem, but it's definitely *a* problem. A simple: prot_settimeout( ret->in, 360 ); right after: ret->in = prot_new(sock, 0); would probably do the trick (totally untested, to be sure). For your problem, pop3d calls: prot_settimeout(popd_in, popd_timeout); just below where you've inserted the KEEPALIVE. What do you have poptimeout set to? I wouldn't be surprised by a bug in prot, BTW. I'm pretty sure I've seen a case where select() is used to implement the timeout but once there's *some* input, read() is called with blocking (wrong!). In any case, if you can get a traceback with gdb for some hung pop3d's, I'm sure we can pinpoint the issue. :wes On 27 May 2010, at 17:52, Gary Mills wrote: > Ever since I can remember, our Cyrus installation had a problem with > pop3d processes accumulating on the murder front end server. This > didn't happen with imapd processes or with pop3d on the back end. A > couple of weeks ago, I counted 423 pop3d processes on the front end > but only 37 on the back end. Some of them were months old. All had > an established TCP connection from a client. Here's a typical stack > trace: > > # pstack 12708 > 12708: pop3d -s > feb1a5c5 read (0, 817faf0, b) > fec2dfaf sock_read () + 3f > > POP3 timeouts were enabled on both front and back ends, but it seemed > not to work on the front end. We're still running cyrus-imapd-2.3.8. > It's possible that this problem is fixed in the current version, > cyrus-imapd-2.3.16. > > In any case, I wanted to try enabling TCP keepalive to see if it had > any effect on the problem. This only required a few lines of code: > > --- pop3d.c-nokeep Wed Apr 11 10:49:59 2007 > +++ pop3d.c Mon May 17 18:17:22 2010 > @@ -494,6 +494,12 @@ > if (getsockname(0, (struct sockaddr *)&popd_localaddr, > &salen) == 0) { > popd_haveaddr = 1; > } > + /* Set keepalive option */ > + { > + int oval = 1; > + (void)setsockopt(0, SOL_SOCKET, SO_KEEPALIVE, (const > void *)&oval, > + sizeof(oval)); > + } > } > > /* other params should be filled in */ > > A complete installation would include a configuration setting to > enable or disable TCP keepalive, along with ways to set keepalive > values that exist in many operating systems. This was just a test, > but it was quite impressive. `pop3d' processes no longer accumulated > on the front end, but were similar in number to the ones on the back > end. The cause must have been clients that disappeared without > closing their TCP connections. The TCP keepalive mechanism now does > this for them, after about half an hour of idleness. > > Does anyone know if this problem has been solved by a timeout in > later Cyrus versions? That's actually a better solution. It does > only seem to happen when pop3d runs on a murder front end, relaying > connections to a back end. If it hasn't been solved, I'll proceed > with the keepalive solution. Otherwise, I'll plan for an upgrade. ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html