I found a fix for a race that sounds very similar to what I am seeing. I am not 100% sure yet but I am reporting progress here in case anyone is looking into this. commit d48f9ce73c997573e1b512893fa6eddf353a6f69 Author: David Vrabel <david.vrabel@xxxxxxxxxx> Date: Mon Sep 19 13:58:30 2016 +0100 sunrpc: fix write space race causing stalls Write space becoming available may race with putting the task to sleep in xprt_wait_for_buffer_space(). The existing mechanism to avoid the race does not work. This (edited) partial trace illustrates the problem: [1] rpc_task_run_action: task:43546@5 ... action=call_transmit [2] xs_write_space <-xs_tcp_write_space [3] xprt_write_space <-xs_write_space [4] rpc_task_sleep: task:43546@5 ... [5] xs_write_space <-xs_tcp_write_space [1] Task 43546 runs but is out of write space. [2] Space becomes available, xs_write_space() clears the SOCKWQ_ASYNC_NOSPACE bit. [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but this has not yet been queued and the wake up is lost. [4] xs_nospace() is called which calls xprt_wait_for_buffer_space() which queues task 43546. [5] The call to sk->sk_write_space() at the end of xs_nospace() (which is supposed to handle the above race) does not call xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and thus the task is not woken. Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace() so the second call to sk->sk_write_space() calls xprt_write_space(). Suggested-by: Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> Signed-off-by: David Vrabel <david.vrabel@xxxxxxxxxx> cc: stable@xxxxxxxxxxxxxxx # 4.4 Signed-off-by: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> Igor On Thu, May 4, 2017 at 9:22 AM, Igor O. <igoros@xxxxxxxxx> wrote: > Hi all, > > I am trying to understand an unresponsive NFS client issue that I can > reproduce in a particular workload running for a few hours. The VM > running the NFS client is Ubuntu 14.04.5, so this is using the > 4.4.0-34 kernel. > > Based on rpc_debug traces, I have a possible hypothesis on how this > can happen as a result of a race on the xprt_pending queue. I was > wondering whether anyone here can comment on whether the hypothesis is > plausible or not. > > The client is stuck in the following state: > > * There is one pending RPC (61009) that began transmitting, but > the transmission was incomplete due to EAGAIN, and will never complete > + RPC 61009 is forever reported as pending by > sunrpc/rpc_clnt/*/tasks in debugfs > + Obviously no other RPC can transmit, since the transport is > locked by RPC 61009 > > * But, there is space in the output TCP buffer: > + Output TCP socket buffer is empty, according to netstat > + Server is reporting non-zero TCP window to the client, > according to tcpdump > > This is the rpc_debug trace of how RPC 61009 got into this state: > > May 4 06:12:02 ub14045-gold kernel: [155440.176714] RPC: 61009 > xprt_transmit(524448) > May 4 06:12:02 ub14045-gold kernel: [155440.176734] RPC: > xs_tcp_send_request(524448) = 0 > May 4 06:12:02 ub14045-gold kernel: [155440.176735] RPC: > xs_tcp_send_request(261456) = -11 > May 4 06:12:02 ub14045-gold kernel: [155440.176738] RPC: 61009 > xmit incomplete (261456 left of 524448) > May 4 06:12:02 ub14045-gold kernel: [155440.176771] RPC: > write space: waking waiting task on xprt ffff88003d1e7000 > May 4 06:12:02 ub14045-gold kernel: [155440.176778] RPC: 61009 > sleep_on(queue "xprt_pending" time 4333753084) > May 4 06:12:02 ub14045-gold kernel: [155440.176778] RPC: 61009 > added to queue ffff88003d1e7258 "xprt_pending" > > In other words: > > 1. 61009 gets EAGAIN, reports xmit incomplete > 2. In race, write space becomes available, the xprt_pending queue > tasks get woken. The queue is empty, though, because 61009 hasn't > queued itself yet. > 3. 61009 queues itself on xprt_pending queue, but it just missed > the notification. > 4. 61009 is forever stuck on xprt_pending queue. There won't be > any future "write space" notification and it will never time out. > > At this point, there isn't any notification that will wake 61009 on > the xprt_pending queue. Since I have a hard mount, there will be no > timeout on the wait, as implemented in xprt_wait_for_buffer_space(). > > Can anyone provide insight on whether the xprt_pending queue handles > the above race condition? Is there another mechanism apart from "write > space" notification (that can be missed) and timeout (not scheduled > due to hard mount) that should still wake the RPC from the > xprt_pending queue? > > Apart from that, any suggested next steps to investigate this? > > Thank you for reading, > Igor Ostrovsky -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html