On Fri, 2014-09-26 at 11:30 +0100, David Woodhouse wrote: > On Sun, 2014-09-21 at 02:00 +0200, Nikos Mavrogiannopoulos wrote: > > On Sat, 2014-09-20 at 13:05 +0200, Niels Peen wrote: > > > Another possible clue: I upgraded from ocserv 0.3 to 0.8 on September 15th. 0.3 has never caused this problem. > > > > I don't see much changes related to that issue from 0.3 to 0.8, but that > > looks like a race issue in the kernel and could be caused by different > > timings between calls. I've applied the suggested fix anyway as it looks > > correct. > > It looks very very wrong to me. Linus had it right at > https://lkml.org/lkml/2002/7/17/165 > > The close() system call should *never* fail to close the file > descriptor. And as Linus points out, your force_close() hack is very > broken in a threaded environment. That doesn't matter much for ocserv as there are no multiple threads. It was added as it looked reasonable for other OSes which may not behave as Linux. > Niels seemed to suggest that the client had gone away, which implies > that there's no ocserv thread servicing the tun device at all. Is that > the case? Is the device in question still *up*? It probably shouldn't > be. (You do have a separate tundev per client?) > > I'd like to see more information about the state of the system when this > failure happens. Reproducing with dnsmasq seems like it would be hard > because to hit the race condition you have to disconnect the VPN after > sending a query but before receiving the reply. I suppose you could hack > openconnect to disconnect after sending a DNS request :) Or use a > different UDP sender on the server side. I'd really love to solve that issue, as I also don't believe that the force_close() is responsible for the solution. > (Has anyone been running VoIP over ocserv connections, btw? This talk of > buffers and EAGAIN reminds me that we need to make sure we avoid > excessive buffering. cf. > http://git.infradead.org/users/dwmw2/openconnect.git/commitdiff/3444f811 > ) I am. One can tune it using the output-buffer parameter. Since you brought that, I suspect that this particular commit must have been the responsible for the asymmetry in upload/download (it was on an old thread, with openconnect upload being 4 times slower than download). Unfortunately I have no longer the hardware to verify that theory. regards, Nikos