Restarting ocserv doesn't clean up all workers

nmav at gnutls.org (Nikos Mavrogiannopoulos) · Sun, 05 Oct 2014 22:34:40 +0200

On Mon, 2014-10-06 at 00:43 +0800, Niels Peen wrote:
> > On 05 Oct 2014, at 03:17, Nikos Mavrogiannopoulos <nmav at gnutls.org> wrote:
> > 
> > So, if I understand correctly, there was a user connection at some
> > point, which go stuck?
> 
> Yes. As far as I can tell these are worker processes that handle a user?s connection. At some point the user disconnects (or loses signal - many of the disconnects are unintentional) and the worker doesn?t get killed. Looking at today?s log It happens to about 1 in 400 workers.
> 
> > There are numerous places where this could occur. Would it be possible
> > to run:
> > $ gdb /usr/sbin/ocserv 21306
> > $ bt full
> Hope this helps:

It does, thank you. It seems we are in the case:
'Under Linux, select() may report a socket file descriptor as "ready for
reading", while nevertheless a subsequent read blocks.  This
could for example happen when data has arrived but upon examination
has wrong checksum and is discarded.  There may be other
circumstances in which a file descriptor is spuriously reported as
ready.  Thus it may be safer to use O_NONBLOCK on sockets that should
not block.'

So if the client disconnected and a packet with wrong checksum is
received, that block occurs, as ocserv depended on select() to check for
data. I've modified ocserv to use non-blocking sockets in master to
avoid that. It seems to work fine in my setup, but I'd like to have more
testing prior to a release.

regards,
Nikos