xprt_wait_for_buffer_space changes causes a hang.

NeilBrown <neilb@xxxxxxx> · Mon, 10 Feb 2014 17:03:15 +1100

Hi,
 We have a customer who reports occasional but reproducible hangs on our 3.0
 based kernel.
 I managed to deduce that 

commit a9a6b52ee1baa865283a91eb8d443ee91adfca56
Author: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
Date:   Fri Feb 22 14:57:57 2013 -0500

    SUNRPC: Don't start the retransmission timer when out of socket space

was to blame (it got into our kernel through -stable ... not sure why it
deserved to be in -stable). Reverting that patch fixes the problem.  However I
don't fully understand why.

Analysing the logs suggests that the NFS client had tried to write a request
to the socket, failed as the transmit queue was temporarily full, and never
got the call-back when the queue emptied.  All other requests piled up behind
this one and all traffic stopped.

With the above commit reverted, the client will timeout and retry the
transmit even if the call-back hasn't arrived.  With the patch, it waits
indefinitely.

This suggests that there is a race in the SOCK_ASYNC_NOSPACE handling.

The most obvious place for a race would be in xs_nospace() between

		if (test_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags)) {

and before
			xprt_wait_for_buffer_space(task, xs_nospace_callback);

However xprt->transport_lock is held here, and xprt_write_space() will wait
for the lock before calling rpc_wake_up_queued_task(), so the wakeup cannot
happen before the task is put on the pending list.

My only remaining theory is that the space is freed before SOCK_NOSPACE is
set in xs_nospace().  I think this would mean that the callback never happens.

In net/ipv4/tcp.c, in tcp_poll() there is code:

			if (sk_stream_is_writeable(sk)) {
				mask |= POLLOUT | POLLWRNORM;
			} else {  /* send SIGIO later */
				set_bit(SOCK_ASYNC_NOSPACE,
					&sk->sk_socket->flags);
				set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);

				/* Race breaker. If space is freed after
				 * wspace test but before the flags are set,
				 * IO signal will be lost.
				 */
				if (sk_stream_is_writeable(sk))
					mask |= POLLOUT | POLLWRNORM;
			}

which suggests there is room for a race, and seems to suggest that it is
safest to test sk_stream_is_writeable() after setting SOCK_NOSPACE.
This is what the sunrpc server side does for udp (svc_udp_has_wspace).
For tcp it doesn't something which I don't really understand, so maybe  I'm
missing something.

Any thoughts?  I could try asking the customer to test with an extra check
for sk_stream_is_writeable before calling xprt_wait_for_buffer_space(), but I
wouldn't be surprised if they'd rather not given they have a working solution.
Does someone understand this code enough to be confident that such a test
would be correct?

Thanks,
NeilBrown
Attachment:
signature.asc

Description: PGP signature