Re: Bad logins bogging down server

Michael Sofka <sofkam@xxxxxxx> · Tue, 19 Sep 2017 09:52:06 -0400

Follow up....

The botnet is still hammering away, checking those old accounts.  But 
the bottleneck appears to have been saslauthd threads.  Doubling the 
thread count from 5 to 10 has resolved the problem for now.  (And, might 
even explain the occasional slow response from IMAP I've observed.)  I 
will run more experiments to see just how high the thread count should 
be, and I've got a list of other optimizations I will try.  This note is 
 in case somebody else sees the same problem.

Mike

On 09/16/2017 07:41 AM, Michael D. Sofka wrote:
I'm seeking help from the collective wisdom of the Cyrus world.

In the past two days we have seen first a doubling, and then a 
quadrupling+ of badlogins to Cyrus.  These appear to be coming from a 
botnet, in that the IPs are spread around in a way that evades fail2ban. 
  It got so bad Friday afternoon, that we took the extraordinary step of 
blocking off-campus connections to IMAP (email can still be read via 
Webmail and the VPN).

The symptoms are that connections grow, and grow and grow until 
authentication slows, holding open connections longer and longer.  It 
takes about 15 minutes for the connection number to be at a point at 
which service is interrupted.  Friday night at attempt was made to 
re-enable off-campus IMAP, and the bots were still at it, service was 
again disrupted.

But the number of connections does not appear close to max permitted by 
Cyrus.

We have a Murder cluster:  Three front-end servers, Two back-end 
servers, Two replication servers.
The front-end servers are Ubuntu 14.04, Cyrus 2.4.17.  The back-end and 
replication servers are Ubuntu 16.04, Cyrus 2.4.18.  (Upgrading 
front-ends on the short list.)

Authentication is via saslauthd, configured to use PAM, which is using 
krb5.  Kerberos is running on three different kerberos servers. Load on 
the kerberos servers is light, and the kerb-admin says nowhere close to 
saturated.  In fact, it handled much higher numbers of authentications 
before imapproxy on the Webmail service. (That was years ago, previous 
kerb servers, so there is still the possibility the kerberos servers are 
somehow slowed....)

Each Front-end server is configured for 5000 imapd on 143, and 5000 on 
port 993.  Netstat shows about 4-5,000 imap connections per front-end 
server when authentication slows.  There are well under 5000 imapd 
processes of either type.  And after the Friday evening test re-allowing 
off-campus IMAP, the network admin reported about 1600 connections to 
port 993 total as IMAP authentication is slowed to a crawl.

We are not close to file-max on any of the servers.

imapd.conf has a 10 second delay for a badlogin.

There are some mupdate log entries

    Thread timed out waiting for listener_lock
    Worker thread finished, for a total of 3 (2 spare)

Around the time of the Friday afternoon problems, when I was restarting 
Front-end servers to recover. And no mupdate log entries since.  What 
does this mean?  There are entries in syslog when mupdate is restarted, 
stating that it could not reset the file limit to 5k. 
mupdate_connections_max is 1024, so the failure to reset has no affect, 
unless that is the limitation.  But I see no log entries indicating that.

Any other resources or limits in either Cyrus or Linux (Debian) that I 
should look at?

Thank you in advance for any help.

Mike

--
Michael D. Sofka               sofkam@xxxxxxx
ITI Sr. Systems Programmer,   Email, TeX, Epistemology
Rensselaer Polytechnic Institute, Troy, NY.  http://www.rpi.edu/~sofkam/
----
Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
To Unsubscribe:
https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus