On 18/11/2011 19:14, Trond Myklebust wrote: > On Fri, 2011-11-18 at 19:04 +0000, Andrew Cooper wrote: >> On 18/11/11 18:52, Trond Myklebust wrote: >>> On Fri, 2011-11-18 at 18:40 +0000, Andrew Cooper wrote: >>>> Hello, >>>> >>>> As described originally in >>>> http://www.spinics.net/lists/linux-nfs/msg25314.html, we were >>>> encountering a bug whereby the NFS session was unexpectedly timing out. >>>> >>>> I believe I have found the source of the race condition causing the timeout. >>>> >>>> Brief overview of setup: >>>> 10GiB network, NFS mounted using TCP. Problem reproduces with >>>> multiple different NICs, with synchronous or asynchronous mounts, and >>>> with soft and hard mounts. Reproduces on 2.6.32 and I am currently >>>> trying to reproduce with mainline. (I don't have physical access to the >>>> servers so installing stuff is not fantastically easy) >>>> >>>> >>>> >>>> In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to >>>> the sock buffer using xs_sendpages() >>>> >>>> When the sock buffer is nearly fully, we get an EAGAIN from >>>> xs_sendpages() which causes a break out of the loop. Lower down the >>>> function, we switch on status which cases us to call xs_nospace() with >>>> the task. >>>> >>>> In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and >>>> in the rare case where that bit is clear, we return 0 instead of >>>> EAGAIN. This promptly overwrites status in xs_tcp_send_request(). >>>> >>>> The result is that xs_tcp_release_xprt() finds a request which has no >>>> error, but has not sent all of the bytes in its send buffer. It cleans >>>> up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue >>>> xprt->task_cleanup, which closes the TCP connection. >>>> >>>> >>>> Under normal operation, the TCP connection goes down and back up without >>>> interruption to the NFS layer. However, when the NFS server hangs in a >>>> half closed state, the client forces a RST of the TCP connection, >>>> leading to the timeout. >>>> >>>> I have tried a few naive fixes such as changing the default return value >>>> in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be >>>> returned) but this causes a kernel memory leak. Can someone who a >>>> better understanding of these interactions than me have a look? It >>>> seems that the if (test_bit()) test in xs_nospace() should have an else >>>> clause. >>> I fully agree with your analysis. The correct thing to do here is to >>> always return either EAGAIN or ENOTCONN. Thank you very much for working >>> this one out! >>> >>> Trond >> Returning EAGAIN seems to cause a kernel memory leak, as the oomkiller >> starts going after processes holding large amounts of LowMem. Returning > The EAGAIN should trigger a retry of the send. > >> ENOTCONN causes the NFS session to complain about a timeout in the logs, >> and in the case of a softmout, give an EIO to the calling process. > Correct. ENOTCONN means that the connection was lost. > >> >From the looks of the TCP stream, and from the the looks of some >> targeted debugging, nothing is actually wrong, so the client should not >> be trying to FIN the TCP connection. Is it possible that there is a >> more sinister reason for SOCK_ASYNC_NOSPACE being clear? > Normally, it means that we're out of the out-of-write-buffer condition > that caused the socket to fail (i.e. the socket has made progress > sending more data, so that we can now resume sending more). Returning > EAGAIN in that condition is correct. Following my latest set of tests, I would have to say that we are in the abnormal case. Removing the test_bit check and always requeuing causes the next call to xs_sendpages() to return with an ENOTCONN. I guess I have some more debugging to do. > >> I can attempt to find which of the many calls to clear that bit is >> actually causing the problem, but I have a feeing that is going to a >> little more tricky to narrow down. >> ~Andrew -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html