Re: Cyrus 2.5.9 imapd children and tcp_keepalive & idle socket questions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09/28/2016 09:59 AM, Andy Dorman wrote:
On 09/27/2016 07:08 PM, ellie timoney via Info-cyrus wrote:
I wonder if we can see what the old processes are actually doing...

Is there any chance you're able to obtain a backtrace of these?

You can use a command something like:

      gdb -p pid /path/to/imapd

where pid is the one of the imapd process ids (which is the last column
in that ls output from /run/cyrus/proc/) and where the correct path to
wherever your imapd executable actually lives (I think this might be
"/usr/lib/cyrus/bin/imapd" on Debian).  You should run this command as
the same user that your cyrus server runs as (probably "cyrus").

Once you're in gdb, look for output like "Reading symbols from
/path/to/imapd...".  If this line also contains "no debugging symbols
found", then we can't get anything useful like this, just "quit" -> "y"
to exit gdb.  But if it does find debugging symbols, great: run "bt" to
obtain a backtrace, then "quit" -> "y" to exit, and send through the
output from the session (probably best to paste the output into a text
file and then attach that to your email).

Another possibility is to trace the system calls.  You can use a command
something like this (also as the "cyrus" user):

    strace -p pid

Do not blindly email the output from this command, it may contain
sensitive/private content!

Let this run for 10 minutes or so, then press ctrl-c to stop it, and
then I guess kind of try to describe the output, and we can go from
there.  If it's very noisy, you've caught it in the middle of doing
something. Wait a little while and try again, or try a different imapd
pid: we're trying to see what the inactive ones are waiting for.

Cheers,

ellie

Ellie, thanks for doing the above "extended debug" version of a reply.
While not a total "nube" to debugging compiled processes, I have not
done it for a long time and never with Linux, so I am unfamiliar with
the tools.  Your words are a great help.

I installed and ran gdb on one of our "problem" servers against an
"idled" pid, 16168, and as you suspected, I found this.

Reading symbols from /usr/lib/cyrus/bin/imapd...(no debugging symbols
found)...done.

FYI, I confirmed this pid is indeed Idle according to this line in
/run/cyrus/proc/16168

imap    gertrude.ironicdesign.com [192.168.0.16]    donations@xxxxxxxxx
cogift.co!user.donations    Idle

So I have installed and started strace against idled pid 16168 and am
collecting the output to a file.

sudo -u cyrus strace -p 16168 -t -o cyrusIdled

16168 is a process that was started about two hours ago (before I saw
your email) after I restarted cyrus to clear the old processes before we
reached the max. I added -t to get a time stamp on each line in case I
need to correlate anything weird with what is going on with the
Blackberry (like perhaps it is unable to connect as the owner drives
around?)

I am tailing the log and so far I am seeing a 60-second cycle with
opening three files (donations.lock, cyrus.index, and cyrus.header) and
then ending with closing all three.

Below is the strace output of a typical cycle I am seeing.

Please note the first and last lines in this output snippet.  As you can
see by the time stamps, the first line is written at the end of the
previous 1 minute cycle, but what you can't see here is that it is
written without the "= 0 (Timeout)".  The next 1 min cycle begins by
adding the "= 0 (Timeout)" to the end of that line and finishes with the
same line again without the "= 0 (Timeout)".

So it looks like the pselect call is timing out after 60 seconds and the
cycle starts over.  I have no idea if this is normal.  I may install
strace on our dev server and see what the strace output looks like with
a normal IMAP client like Thunderbird.

09:39:08 pselect6(16, [0 15], NULL, NULL, {60, 0}, {[], 8}) = 0 (Timeout)
09:40:08 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:40:08 gettimeofday({1475073608, 554369}, NULL) = 0
09:40:08
open("/run/cyrus/lock/domain/W/cogift.co/U/user/donations.lock",
O_RDWR|O_CREAT|O_TRUNC, 0666) = 19
09:40:08 fcntl(19, F_SETLKW, {l_type=F_RDLCK, l_whence=SEEK_SET,
l_start=0, l_len=0}) = 0
09:40:08 fcntl(12, F_SETLKW, {l_type=F_RDLCK, l_whence=SEEK_SET,
l_start=0, l_len=0}) = 0
09:40:08 fstat(12, {st_mode=S_IFREG|0600, st_size=218344, ...}) = 0
09:40:08 stat("/var/lib/cyrus/mailboxes.db", {st_mode=S_IFREG|0600,
st_size=218344, ...}) = 0
09:40:08 fcntl(12, F_SETLKW, {l_type=F_UNLCK, l_whence=SEEK_SET,
l_start=0, l_len=0}) = 0
09:40:08
open("/var/spool/cyrus/mail/domain/W/cogift.co/U/user/donations/cyrus.index",
O_RDWR) = 20
09:40:08 fstat(20, {st_mode=S_IFREG|0600, st_size=1688, ...}) = 0
09:40:08 mmap(NULL, 16384, PROT_READ, MAP_SHARED, 20, 0) = 0x7f05ab7a9000
09:40:08 fcntl(20, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET,
l_start=0, l_len=0}) = 0
09:40:08 gettimeofday({1475073608, 555550}, NULL) = 0
09:40:08
stat("/var/spool/cyrus/mail/domain/W/cogift.co/U/user/donations/cyrus.index",
{st_mode=S_IFREG|0600, st_size=1688, ...}) = 0
09:40:08
open("/var/spool/cyrus/mail/domain/W/cogift.co/U/user/donations/cyrus.header",
O_RDONLY) = 21
09:40:08 fstat(21, {st_mode=S_IFREG|0600, st_size=207, ...}) = 0
09:40:08 mmap(NULL, 207, PROT_READ, MAP_SHARED, 21, 0) = 0x7f05ab7a8000
09:40:08 munmap(0x7f05ab7a8000, 207)    = 0
09:40:08 fcntl(20, F_SETLKW, {l_type=F_UNLCK, l_whence=SEEK_SET,
l_start=0, l_len=0}) = 0
09:40:08 gettimeofday({1475073608, 556060}, NULL) = 0
09:40:08 open("/var/lib/cyrus/msg/shutdown", O_RDONLY) = -1 ENOENT (No
such file or directory)
09:40:08 fcntl(14, F_SETLKW, {l_type=F_RDLCK, l_whence=SEEK_SET,
l_start=0, l_len=0}) = 0
09:40:08 fstat(14, {st_mode=S_IFREG|0600, st_size=336, ...}) = 0
09:40:08 stat("/var/lib/cyrus/user_deny.db", {st_mode=S_IFREG|0600,
st_size=336, ...}) = 0
09:40:08 fcntl(14, F_SETLKW, {l_type=F_UNLCK, l_whence=SEEK_SET,
l_start=0, l_len=0}) = 0
09:40:08 fcntl(20, F_SETLKW, {l_type=F_RDLCK, l_whence=SEEK_SET,
l_start=0, l_len=0}) = 0
09:40:08 gettimeofday({1475073608, 556577}, NULL) = 0
09:40:08
stat("/var/spool/cyrus/mail/domain/W/cogift.co/U/user/donations/cyrus.index",
{st_mode=S_IFREG|0600, st_size=1688, ...}) = 0
09:40:08 close(21)                      = 0
09:40:08
open("/var/spool/cyrus/mail/domain/W/cogift.co/U/user/donations/cyrus.header",
O_RDONLY) = 21
09:40:08 fstat(21, {st_mode=S_IFREG|0600, st_size=207, ...}) = 0
09:40:08 mmap(NULL, 207, PROT_READ, MAP_SHARED, 21, 0) = 0x7f05ab7a8000
09:40:08 munmap(0x7f05ab7a8000, 207)    = 0
09:40:08 fcntl(20, F_SETLKW, {l_type=F_UNLCK, l_whence=SEEK_SET,
l_start=0, l_len=0}) = 0
09:40:08 gettimeofday({1475073608, 557191}, NULL) = 0
09:40:08 close(21)                      = 0
09:40:08 close(20)                      = 0
09:40:08 munmap(0x7f05ab7a9000, 16384)  = 0
09:40:08 close(19)                      = 0
09:40:08 rt_sigprocmask(SIG_BLOCK, [INT QUIT ALRM TERM CHLD], [], 8) = 0
09:40:08 pselect6(16, [0 15], NULL, NULL, {60, 0}, {[], 8}

I have no idea if the pselect timeout is important, but I wanted to
point out that behavior since it is not apparent from just looking at
the output after the fact.

I am going to let this run all day and will scan it for anything
different from this 60-second cycle and will post results back to the
list around this time tomorrow.

Thanks again for all your help.


OK, after almost 24 hours, donations@xxxxxxxxxx had 56 Idle imap processes. The original pid #16168 I watched above continued to do this 60-second cycle the entire 24 hours. I picked some of the other 56 pids at random and watched them for a bit and they were all doing the same 60-second cycle.

It is probably obvious to others on this list, but for a comparison I watched a pid for a different low volume account (maybe 5 emails/day) but with a Thunderbird client.

When I started the trace the proc file showed the imap process as Idle just like the problem one above, but the trace looked completely different...no timeout on the pselect6 call (until the end), and no fixed 60-second loop. And perhaps more importantly, after about two hours the first pselect6 call got a timeout and the pid turned itself off like this.

09:24:09 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:24:09 rt_sigprocmask(SIG_BLOCK, [INT QUIT ALRM TERM CHLD], [], 8) = 0
09:24:09 pselect6(1, [0], NULL, NULL, {0, 0}, {[], 8}) = 0 (Timeout)
09:24:09 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
09:24:09 sendto(6, "<20>Sep 29 09:24:09 cyrus/imap[7"..., 75, MSG_NOSIGNAL, NULL, 0) = 75 09:24:09 write(1, "\322Rp\212t\205T\254i\371E\n%\371\371\300Z8/\235\227\v\0\0\0\377\377", 27) = 27
09:24:09 unlink("/run/cyrus/proc/7426") = 0
09:24:09 munmap(0x7f56c3c5e000, 135168) = 0
09:24:09 munmap(0x7f56c3c7f000, 135168) = 0
09:24:09 open("/dev/null", O_RDWR)      = 16
09:24:09 shutdown(0, SHUT_RD)           = 0
09:24:09 dup2(16, 0)                    = 0
09:24:09 shutdown(1, SHUT_RD)           = 0
09:24:09 dup2(16, 1)                    = 1
09:24:09 shutdown(2, SHUT_RD)           = 0
09:24:09 dup2(16, 2)                    = 2
09:24:09 close(16)                      = 0
09:24:09 write(3, "\1\0\0\0\2\35\0\0", 8) = 8
09:24:09 rt_sigaction(SIGALRM, {0x7f56c284db80, [], SA_RESTORER|SA_RESETHAND|SA_SIGINFO, 0x7f56c307eeb0}, NULL, 8) = 0 09:24:09 rt_sigaction(SIGQUIT, {0x7f56c284db80, [], SA_RESTORER|SA_RESETHAND|SA_SIGINFO, 0x7f56c307eeb0}, NULL, 8) = 0 09:24:09 rt_sigaction(SIGINT, {0x7f56c284db80, [], SA_RESTORER|SA_RESETHAND|SA_SIGINFO, 0x7f56c307eeb0}, NULL, 8) = 0 09:24:09 rt_sigaction(SIGTERM, {0x7f56c284db80, [], SA_RESTORER|SA_RESETHAND|SA_SIGINFO, 0x7f56c307eeb0}, NULL, 8) = 0 09:24:09 rt_sigaction(SIGHUP, {0x7f56c284db80, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7f56c307eeb0}, NULL, 8) = 0
09:24:09 alarm(109)                     = 0
09:24:09 fcntl(15, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
09:25:58 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
09:25:58 rt_sigreturn({mask=[]})        = -1 EINTR (Interrupted system call)
09:25:58 close(10)                      = 0
09:25:58 munmap(0x7f56c3ca8000, 40960)  = 0
09:25:58 close(11)                      = 0
09:25:58 munmap(0x7f56c3cc8000, 24576)  = 0
09:25:58 close(12)                      = 0
09:25:58 munmap(0x7f56c3ca4000, 16384)  = 0
09:25:58 close(14)                      = 0
09:25:58 munmap(0x7f56c3ca0000, 16384)  = 0
09:25:58 close(13)                      = 0
09:25:58 unlink("/var/lib/cyrus/socket/idle.7426") = 0
09:25:58 shutdown(0, SHUT_RD) = -1 ENOTSOCK (Socket operation on non-socket)
09:25:58 close(0)                       = 0
09:25:58 shutdown(1, SHUT_RD) = -1 ENOTSOCK (Socket operation on non-socket)
09:25:58 close(1)                       = 0
09:25:58 shutdown(2, SHUT_RD) = -1 ENOTSOCK (Socket operation on non-socket)
09:25:58 close(2)                       = 0
09:25:58 exit_group(0)                  = ?
09:25:58 +++ exited with 0 +++

Just FYI, my tcp_keepalive settings are here:

tcp_keepalive: 1
tcp_keepalive_idle: 900
tcp_keepalive_cnt: 5
tcp_keepalive_intvl: 75

So I expect at this point my options are:

- Tell this user to NOT use their BlackBerry
OR
- plan to restart cyrus-imap every 24 hours on the 3 servers that the Blackberry is checking until this user's Blackberry dies or Cyrus 2.5.x is updated to deal with it (whatever "it" is).

If it will help I am still willing to set up wireshark to collect port 143 traffic from the BB to see if we can see what the BB is doing (or not doing) to cause this.

--
Andy Dorman
Ironic Design, Inc.
AnteSpam.com

----
Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
To Unsubscribe:
https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus



[Index of Archives]     [Cyrus SASL]     [Squirrel Mail]     [Asterisk PBX]     [Video For Linux]     [Photo]     [Yosemite News]     [gtk]     [KDE]     [Gimp on Windows]     [Steve's Art]

  Powered by Linux