On Fri, 13 Aug 2021, J. Bruce Fields wrote: > On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote: > > > > The problem here appears to be that a signalled task is being retried > > without clearing the SIGNALLED flag. That is causing the infinite loop > > and the soft lockup. > > > > This bug appears to have been introduced in Linux 5.2 by > > Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling") > > I wonder how we arrived here. Does it require that an rpc task returns > from one of those rpc_delay() calls just as rpc_shutdown_client() is > signalling it? That's the only way async tasks get signalled, I think. I don't think "just as" is needed. I think it could only happen if rpc_shutdown_client() were called when there were active tasks - presumably from nfsd4_process_cb_update(), but I don't know the callback code well. If any of those active tasks has a ->done handler which might try to reschedule the task when tk_status == -ERESTARTSYS, then you get into the infinite loop. > > > Prior to this commit a flag RPC_TASK_KILLED was used, and it gets > > cleared by rpc_reset_task_statistics() (called from rpc_exit_task()). > > After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never > > cleared. > > > > A fix might be to clear RPC_TASK_SIGNALLED in > > rpc_reset_task_statistics(), but I'll leave that decision to someone > > else. > > Might be worth testing with that change just to verify that this is > what's happening. > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > index c045f63d11fa..caa931888747 100644 > --- a/net/sunrpc/sched.c > +++ b/net/sunrpc/sched.c > @@ -813,7 +813,8 @@ static void > rpc_reset_task_statistics(struct rpc_task *task) > { > task->tk_timeouts = 0; > - task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT); > + task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SIGNALLED| > + RPC_TASK_SENT); NONONONONO. RPC_TASK_SIGNALLED is a flag in tk_runstate. So you need clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate); NeilBrown