On Sun, 2014-09-21 at 02:00 +0200, Nikos Mavrogiannopoulos wrote: > On Sat, 2014-09-20 at 13:05 +0200, Niels Peen wrote: > > Another possible clue: I upgraded from ocserv 0.3 to 0.8 on September 15th. 0.3 has never caused this problem. > > I don't see much changes related to that issue from 0.3 to 0.8, but that > looks like a race issue in the kernel and could be caused by different > timings between calls. I've applied the suggested fix anyway as it looks > correct. It looks very very wrong to me. Linus had it right at https://lkml.org/lkml/2002/7/17/165 The close() system call should *never* fail to close the file descriptor. And as Linus points out, your force_close() hack is very broken in a threaded environment. On Linux, if you get an error from close() then it's only a warning that the flush failed; the fd *is* still closed correctly. If that happens and you loop in your force_close() function and try closing it again, you might not get EBADF. You might just close a fd which was validly opened by another thread and happened to have the same number as your old one. I think we should investigate this problem better, starting from the reported symptoms. When dnsmasq is receiving that EAGAIN error, it's almost certainly because a network buffer is full, probably because ocserv is not servicing the tun device quickly enough. (Let's ignore for the moment that dnsmasq ought to be letting packets drop when this happens; that's the whole point of using UDP. I'm assuming that's the "work-around" which Simon provided?) Niels seemed to suggest that the client had gone away, which implies that there's no ocserv thread servicing the tun device at all. Is that the case? Is the device in question still *up*? It probably shouldn't be. (You do have a separate tundev per client?) I'd like to see more information about the state of the system when this failure happens. Reproducing with dnsmasq seems like it would be hard because to hit the race condition you have to disconnect the VPN after sending a query but before receiving the reply. I suppose you could hack openconnect to disconnect after sending a DNS request :) Or use a different UDP sender on the server side. (Has anyone been running VoIP over ocserv connections, btw? This talk of buffers and EAGAIN reminds me that we need to make sure we avoid excessive buffering. cf. http://git.infradead.org/users/dwmw2/openconnect.git/commitdiff/3444f811 ) -- dwmw2 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5745 bytes Desc: not available URL: <http://lists.infradead.org/pipermail/openconnect-devel/attachments/20140926/f0820ff9/attachment.bin>