Follow up....
The botnet is still hammering away, checking those old accounts. But
the bottleneck appears to have been saslauthd threads. Doubling the
thread count from 5 to 10 has resolved the problem for now. (And, might
even explain the occasional slow response from IMAP I've observed.) I
will run more experiments to see just how high the thread count should
be, and I've got a list of other optimizations I will try. This note is
in case somebody else sees the same problem.
Mike
On 09/16/2017 07:41 AM, Michael D. Sofka wrote:
I'm seeking help from the collective wisdom of the Cyrus world.
In the past two days we have seen first a doubling, and then a
quadrupling+ of badlogins to Cyrus. These appear to be coming from a
botnet, in that the IPs are spread around in a way that evades fail2ban.
It got so bad Friday afternoon, that we took the extraordinary step of
blocking off-campus connections to IMAP (email can still be read via
Webmail and the VPN).
The symptoms are that connections grow, and grow and grow until
authentication slows, holding open connections longer and longer. It
takes about 15 minutes for the connection number to be at a point at
which service is interrupted. Friday night at attempt was made to
re-enable off-campus IMAP, and the bots were still at it, service was
again disrupted.
But the number of connections does not appear close to max permitted by
Cyrus.
We have a Murder cluster: Three front-end servers, Two back-end
servers, Two replication servers.
The front-end servers are Ubuntu 14.04, Cyrus 2.4.17. The back-end and
replication servers are Ubuntu 16.04, Cyrus 2.4.18. (Upgrading
front-ends on the short list.)
Authentication is via saslauthd, configured to use PAM, which is using
krb5. Kerberos is running on three different kerberos servers. Load on
the kerberos servers is light, and the kerb-admin says nowhere close to
saturated. In fact, it handled much higher numbers of authentications
before imapproxy on the Webmail service. (That was years ago, previous
kerb servers, so there is still the possibility the kerberos servers are
somehow slowed....)
Each Front-end server is configured for 5000 imapd on 143, and 5000 on
port 993. Netstat shows about 4-5,000 imap connections per front-end
server when authentication slows. There are well under 5000 imapd
processes of either type. And after the Friday evening test re-allowing
off-campus IMAP, the network admin reported about 1600 connections to
port 993 total as IMAP authentication is slowed to a crawl.
We are not close to file-max on any of the servers.
imapd.conf has a 10 second delay for a badlogin.
There are some mupdate log entries
Thread timed out waiting for listener_lock
Worker thread finished, for a total of 3 (2 spare)
Around the time of the Friday afternoon problems, when I was restarting
Front-end servers to recover. And no mupdate log entries since. What
does this mean? There are entries in syslog when mupdate is restarted,
stating that it could not reset the file limit to 5k.
mupdate_connections_max is 1024, so the failure to reset has no affect,
unless that is the limitation. But I see no log entries indicating that.
Any other resources or limits in either Cyrus or Linux (Debian) that I
should look at?
Thank you in advance for any help.
Mike
--
Michael D. Sofka sofkam@xxxxxxx
ITI Sr. Systems Programmer, Email, TeX, Epistemology
Rensselaer Polytechnic Institute, Troy, NY. http://www.rpi.edu/~sofkam/
----
Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
To Unsubscribe:
https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus