On 16 Jul 2024, at 9:23, Trond Myklebust wrote: > On Fri, 2024-07-12 at 09:55 -0400, Benjamin Coddington wrote: >> We've observed NFS clients with sync tasks sleeping in __rpc_execute >> waiting on RPC_TASK_QUEUED that have not responded to a wake-up from >> rpc_make_runnable(). I suspect this problem usually goes unnoticed, >> because on a busy client the task will eventually be re-awoken by >> another >> task completion or xprt event. However, if the state manager is >> draining >> the slot table, a sync task missing a wake-up can result in a hung >> client. >> >> We've been able to prove that the waker in rpc_make_runnable() >> successfully >> calls wake_up_bit() (ie- there's no race to tk_runstate), but the >> wake_up_bit() call fails to wake the waiter. I suspect the waker is >> missing the load of the bit's wait_queue_head, so waitqueue_active() >> is >> false. There are some very helpful comments about this problem above >> wake_up_bit(), prepare_to_wait(), and waitqueue_active(). >> >> Fix this by inserting smp_mb() before the wake_up_bit(), which pairs >> with >> prepare_to_wait() calling set_current_state(). >> >> Signed-off-by: Benjamin Coddington <bcodding@xxxxxxxxxx> >> --- >> net/sunrpc/sched.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >> index 6debf4fd42d4..34b31be75497 100644 >> --- a/net/sunrpc/sched.c >> +++ b/net/sunrpc/sched.c >> @@ -369,8 +369,11 @@ static void rpc_make_runnable(struct >> workqueue_struct *wq, >> if (RPC_IS_ASYNC(task)) { >> INIT_WORK(&task->u.tk_work, rpc_async_schedule); >> queue_work(wq, &task->u.tk_work); >> - } else >> + } else { >> + /* paired with set_current_state() in >> prepare_to_wait */ >> + smp_mb(); > > Hmm... Why isn't it sufficient to use smp_mb__after_atomic() here? > That's what clear_and_wake_up_bit() uses in this case. Great question, one that I can't answer with confidence (which is why I solicited advice under the RFC posting). I ended up following the guidance in the comment above wake_up_bit(), since we clear RPC_TASK_RUNNING non-atomically within the queue's spinlock. It states that we'd probably need smp_mb(). However - this race isn't to tk_runstate, its actually to waitqueue_active() being true/false. So - I believe unless we're checking the bit in the waiter's wait_bit_action function, we really only want to look at the race in the terms of making sure the waiter's setting of the wait_queue_head is visible after the waker tests_and_sets RPC_TASK_RUNNING. Ben