Hi, We have a customer who reports occasional but reproducible hangs on our 3.0 based kernel. I managed to deduce that commit a9a6b52ee1baa865283a91eb8d443ee91adfca56 Author: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> Date: Fri Feb 22 14:57:57 2013 -0500 SUNRPC: Don't start the retransmission timer when out of socket space was to blame (it got into our kernel through -stable ... not sure why it deserved to be in -stable). Reverting that patch fixes the problem. However I don't fully understand why. Analysing the logs suggests that the NFS client had tried to write a request to the socket, failed as the transmit queue was temporarily full, and never got the call-back when the queue emptied. All other requests piled up behind this one and all traffic stopped. With the above commit reverted, the client will timeout and retry the transmit even if the call-back hasn't arrived. With the patch, it waits indefinitely. This suggests that there is a race in the SOCK_ASYNC_NOSPACE handling. The most obvious place for a race would be in xs_nospace() between if (test_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags)) { and before xprt_wait_for_buffer_space(task, xs_nospace_callback); However xprt->transport_lock is held here, and xprt_write_space() will wait for the lock before calling rpc_wake_up_queued_task(), so the wakeup cannot happen before the task is put on the pending list. My only remaining theory is that the space is freed before SOCK_NOSPACE is set in xs_nospace(). I think this would mean that the callback never happens. In net/ipv4/tcp.c, in tcp_poll() there is code: if (sk_stream_is_writeable(sk)) { mask |= POLLOUT | POLLWRNORM; } else { /* send SIGIO later */ set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); /* Race breaker. If space is freed after * wspace test but before the flags are set, * IO signal will be lost. */ if (sk_stream_is_writeable(sk)) mask |= POLLOUT | POLLWRNORM; } which suggests there is room for a race, and seems to suggest that it is safest to test sk_stream_is_writeable() after setting SOCK_NOSPACE. This is what the sunrpc server side does for udp (svc_udp_has_wspace). For tcp it doesn't something which I don't really understand, so maybe I'm missing something. Any thoughts? I could try asking the customer to test with an extra check for sk_stream_is_writeable before calling xprt_wait_for_buffer_space(), but I wouldn't be surprised if they'd rather not given they have a working solution. Does someone understand this code enough to be confident that such a test would be correct? Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature