long (~2.5 minute) delay in TLS handshake

Michael.Wojcik@xxxxxxxxxxxxxx (Michael Wojcik) · Mon, 30 Nov 2015 22:46:45 +0000

I'm curious if anyone has seen anything like this before.

We have a situation at one customer site. They see it happen every few days. No one else has reported it, and we can't reproduce it.

There's a Linux server, listening on multiple ports, handling lots of conversations (multiplexed with poll). Various protocols, some TLS, others not. Clients from many remote systems connect to this server. Some conversations are short-lived, others long-lived.

Four of the ports are handling Telnet (TN3270) traffic, over TLS.

Sometimes one of the ports stops responding to new conversations, from the client's point of view. Other clients continue to connect to other ports owned by the same server process; established conversations continue to work. After a while (maybe 15 minutes or so), the problem goes away. Note, again, that this only applies to new conversations on this one port. Everything else in the same process is happy.

A wire trace taken while the problem is occuring shows:

1. Client sends ClientHello; server stack ACKs it immediately.
2. A minute passes with no activity on the conversation.
3. Client gives up - we get a FIN from it. Server stack ACKs the FIN immediately.
4. Almost a minute and a half later (89 seconds in the case I'm looking at), the server happily sends the ServerHello. Well, that's a bit too late, and there's the usual crying and recriminations (RSTs from the client stack).

So nearly 2.5 minutes between ClientHello being received by the server machine's stack, and the ServerHello appearing on the wire. We know there's nothing generally wrong with the network or machines, and the processes in question are otherwise behaving normally.

ServerHello shows the server chose TLS_RSA_WITH_AES_256_CBC_SHA (TLS/1.0), so there's nothing screwy like computing DH parameters happening behind the covers. It's too early in the process for certificate validation callbacks to be invoked. Or for nearly anything else to be happening. All the server has is the ClientHello.

One thing I don't have at this point is any tracepoints I can have the customer enable to see if, say, we're getting a lot of SSL_WANT_READ or SSL_WANT_WRITE from SSL_accept. The socket should be in blocking mode, though it's possible there's some bug there.

The logic here is not exotic. It's along the lines of:
        desc = accept(master, ...);
        ssl = SSL_new(ctx);
        SSL_set_fd(ssl, desc);
        SSL_accept(ssl);

There's some setting of socket options like SO_KEEPALIVE and ex_data so we can recover our info in the callbacks, but really it's all pretty standard.

Any ideas?

--
Michael Wojcik
Technology Specialist, Micro Focus

Please consider the environment before printing this e-mail.