Re: PROBLEM: NFS client IO fails with ERESTARTSYS when another mount point with the same export is unmounted with force [NFS] [SUNRPC]

"NeilBrown" <neilb@xxxxxxx> · Wed, 28 Feb 2024 10:55:43 +1100

On Fri, 23 Feb 2024, Zhitao Li wrote:
> Thanks for Jeff's reply.
> 
> I did see  ERESTARTSYS in userland. As described in the above
> "Reproduction" chapter, "dd" fails with "dd: error writing
> '/mnt/test1/1G': Unknown error 512".
> 
> After strace "dd", it turns out that syscall WRITE fails with:
> write(1, "4\303\31\211\316\237\333\r-\275g\370\233\374X\277\374Tb\202\24\365\220\320\16\27o3\331q\344\364"...,
> 1048576) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
> 
> In fact, other syscalls related to file systems can also fail with
> ERESTARTSYS in our cases, for example: mount, open, read, write and so
> on.
> 
> Maybe the reason is that on forced unmount, rpc_killall_tasks() in
> net/sunrpc/clnt.c will set all inflight IO with ERESTARTSYS, while no
> signal gets involved. So ERESTARTSYS is not handled before entering
> userspace.
> 
> Best regards,
> Zhitao Li at SmartX.
> 
> On Thu, Feb 22, 2024 at 7:05 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> >
> > On Wed, 2024-02-21 at 13:48 +0000, Trond Myklebust wrote:
> > > On Wed, 2024-02-21 at 16:20 +0800, Zhitao Li wrote:
> > > > [You don't often get email from zhitao.li@xxxxxxxxxx. Learn why this
> > > > is important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >
> > > > Hi, everyone,
> > > >
> > > > - Facts:
> > > > I have a remote NFS export and I mount the same export on two
> > > > different directories in my OS with the same options. There is an
> > > > inflight IO under one mounted directory. And then I unmount another
> > > > mounted directory with force. The inflight IO ends up with "Unknown
> > > > error 512", which is ERESTARTSYS.
> > > >
> > >
> > > All of the above is well known. That's because forced umount affects
> > > the entire filesystem. Why are you using it here in the first place? It
> > > is not intended for casual use.
> > >
> >
> > While I agree Trond's above statement, the kernel is not supposed to
> > leak error codes that high into userland. Are you seeing ERESTARTSYS
> > being returned to system calls? If so, which ones?
> > --
> > Jeff Layton <jlayton@xxxxxxxxxx>
> 

I think this bug was introduced by 
Commit ae67bd3821bb ("SUNRPC: Fix up task signalling")
in Linux v5.2.

Prior to that commit, rpc_killall_tasks set the error to -EIO.
After that commit it calls rpc_signal_task which always uses
-ERESTARTSYS.

This might be an appropriate fix.  Can you please test and confirm?

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 2d61987b3545..ed3a116efd5d 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -222,7 +222,7 @@ void		rpc_put_task(struct rpc_task *);
 void		rpc_put_task_async(struct rpc_task *);
 bool		rpc_task_set_rpc_status(struct rpc_task *task, int rpc_status);
 void		rpc_task_try_cancel(struct rpc_task *task, int error);
-void		rpc_signal_task(struct rpc_task *);
+void		rpc_signal_task(struct rpc_task *, int);
 void		rpc_exit_task(struct rpc_task *);
 void		rpc_exit(struct rpc_task *, int);
 void		rpc_release_calldata(const struct rpc_call_ops *, void *);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index cda0935a68c9..cdbdfae13030 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -895,7 +895,7 @@ void rpc_killall_tasks(struct rpc_clnt *clnt)
 	trace_rpc_clnt_killall(clnt);
 	spin_lock(&clnt->cl_lock);
 	list_for_each_entry(rovr, &clnt->cl_tasks, tk_task)
-		rpc_signal_task(rovr);
+		rpc_signal_task(rovr, -EIO);
 	spin_unlock(&clnt->cl_lock);
 }
 EXPORT_SYMBOL_GPL(rpc_killall_tasks);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 6debf4fd42d4..e88621881036 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -852,14 +852,14 @@ void rpc_exit_task(struct rpc_task *task)
 	}
 }
 
-void rpc_signal_task(struct rpc_task *task)
+void rpc_signal_task(struct rpc_task *task, int sig)
 {
 	struct rpc_wait_queue *queue;
 
 	if (!RPC_IS_ACTIVATED(task))
 		return;
 
-	if (!rpc_task_set_rpc_status(task, -ERESTARTSYS))
+	if (!rpc_task_set_rpc_status(task, sig))
 		return;
 	trace_rpc_task_signalled(task, task->tk_action);
 	set_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
@@ -992,7 +992,7 @@ static void __rpc_execute(struct rpc_task *task)
 			 * clean up after sleeping on some queue, we don't
 			 * break the loop here, but go around once more.
 			 */
-			rpc_signal_task(task);
+			rpc_signal_task(task, -ERESTARTSYS);
 		}
 		trace_rpc_task_sync_wake(task, task->tk_action);
 	}