----- Original Message ----- > From: "Wiebe Cazemier" <wiebe@xxxxxxxxxxxx> > To: openssl-users@xxxxxxxxxxx > Sent: Thursday, 23 May, 2024 12:22:31 > Subject: Blocking on a non-blocking socket? > > Hi List, > > I have a very obscure problem with an application using O_NONBLOCK still > blocking. Over the course of a year of running with hundreds of thousands of > clients, it has happened twice over the last month that a worker thread froze. > It's a long story, but I'm pretty sure it's not a deadlock or spinning event > loop or something, primarily because the application recovers after about 20 > minutes with a client errorring out with ETIMEDOUT. Coincidentally, that 20 > minutes matches the timeout description of the tcp man page [1]. > > It really looks like a non-blocking socket is still blocking. I found something > with a similar problem ([2]), but what they think of SSL_MODE_AUTO_RETRY does > not match the documentation. > > So, is there indeed any way an application that has SSL_MODE_AUTO_RETRY on > (which is default since 1.1.1) can block? Looking at the source code, I don't > see any calls to fcntl() that removes the O_NONBLOCK. > > My IO method is SSL_read() and SSL_write() with an SSL object given to > SSL_set_fd(). > > The only SSL modes I change from the default is that I set > SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER. > > There are two primary deployments of this application, one with OpenSSL 1.1.1 > and one with 3.0.0. Only 1.1.1 has shown this problem, but it may be a > coincidence. > > Side question, is it a problem to set SSL_set_fd() before using fcntl to set the > fd to O_NONBLOCK? I ask, because the docs say "The BIO and hence the SSL engine > inherit the behaviour of fd. If fd is non-blocking, the ssl will also have > non-blocking behaviour.". The 'inherit' may be a key word here; not sure when > it's done. > > Regards, > > Wiebe Cazemier As a follow-up, the fault did turn out to be my own... As I imagine [1] is. They describe SSL_MODE_AUTO_RETRY 'attempts to renegotiate a broken SSL connection', but all SSL_MODE_AUTO_RETRY indeed really does is read multiple records at a time, without returning from read. Despite what I thought before, my code actually did have an unfortunate edge case where there was a while loop spinning on SSL_write() when there was no room in the socket. This would eventually fail with ETIMEDOUT. Well, it was educational at least... [1] https://github.com/alanxz/rabbitmq-c/issues/586