Re: Possible Race Condition on SIGKILL

"Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> · Mon, 7 Jan 2013 19:50:10 +0000

On Mon, 2013-01-07 at 13:58 -0500, Chris Perl wrote:
+AD4- (Apologies up front for the long email, just trying to be thorough)
+AD4- 
+AD4- I have come across a hang in the Linux NFS client that is reproducible.  I've
+AD4- been testing with a Fedora 17 system (3.6.10-2.fc17.x86+AF8-64), but I've also seen
+AD4- the same issues on CentOS 6.3 systems (2.6.32-279.5.2.el6.centos.plus.x86+AF8-64).
+AD4- I have not had a chance to test the latest 3.8 rc kernels yet, but quickly
+AD4- browsing their code, I didn't see major changes in the key areas to make me
+AD4- think it would make a difference (but I could certainly be wrong).
+AD4- 
+AD4- Unfortunately, the problem is not easily reproducible.  So far, the only way
+AD4- that I've been able to trigger it is with the attached OCaml program +AFs-1+AF0-.
+AD4- 
+AD4- Even more unfortunate, for you to compile and use that program, you'll need to
+AD4- have a functioning OCaml environment with Jane Street's Core and Async
+AD4- libraries available.  I can give details about the easiest way to set that up
+AD4- if you want, but will hold off on that for now.  Please let me know if you want
+AD4- these details.
+AD4- 
+AD4- The idea of the program is that it spawns many threads that start stat'ing and
+AD4- reading files over NFS (the paths to each file are command line args) and then
+AD4- abruptly exits.
+AD4- 
+AD4- Originally, I came across a system that hung and forced a crash dump of the
+AD4- kernel that I then inspected with crash +AFs-2+AF0-.  With crash I could see that the
+AD4- rpc+AF8-task referenced by the snd+AF8-task member in the relevant rpc+AF8-xprt didn't
+AD4- exist.
+AD4- 
+AD4- Then after talking with some coworkers (and inspecting several more hung
+AD4- systems), eventually came up with a way to reproduce the problem.  To reproduce
+AD4- the problem, you actually need to take the compiled OCaml program (which
+AD4- purposely exits very quickly after starting) and wrap it in a loop that will
+AD4- immediately restart it when it exits:
+AD4- 
+AD4- +ACQ- while :+ADs- do ./hang.opt /mnt/nfs/file+AHs-01..10+AH0- +AD4-/dev/null 2+AD4AJg-1+ADs- done                  
+AD4- 
+AD4- In this case, file01 through 10 are 10k text files filled entirely with the
+AD4- character +ACI-A+ACI-.  /mnt/nfs is the mount point with the following options
+AD4- 
+AD4- +ACQ- nfsstat -m
+AD4- /mnt/nfs from 127.0.0.1:/var/tmp
+AD4- Flags: rw,relatime,vers+AD0-3rsize+AD0-524288,wsize+AD0-524288,namelen+AD0-255,acregmin+AD0-0,acregmax+AD0-0,acdirmin+AD0-0,acdirmax+AD0-0,hard,proto+AD0-tcp,timeo+AD0-600,retrans+AD0-2,sec+AD0-sys,mountaddr+AD0-127.0.0.1,mountvers+AD0-3,mountport+AD0-20048,mountproto+AD0-tcp,local+AF8-lock+AD0-none,addr+AD0-127.0.0.1
+AD4- 
+AD4- And the hang should happen within a few minutes (although sometimes it did
+AD4- take quite a while for it to hang).
+AD4- 
+AD4- I believe the problem is a race condition where one thread that currently has
+AD4- the +AGA-rpc+AF8-xprt' for a given NFS mount locked tries to give it to another thread
+AD4- on the +AGA-sending' wait queue for that +AGA-rpc+AF8-xprt' (via +AF8AXw-xprt+AF8-lock+AF8-write+AF8-next),
+AD4- but the thread that is presumably sleeping on that wait queue never made it to
+AD4- sleep because it received a SIGKILL before it could do so.
+AD4- 
+AD4- Here is the interaction that I believe is causing the problem:
+AD4- 
+AD4- thread 1					     thread 2
+AD4- --------					     --------
+AD4- +AF8AXw-rpc+AF8-execute
+AD4- ...
+AD4- call+AF8-reserve
+AD4- xprt+AF8-reserve
+AD4- xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot
+AD4- xprt+AF8-lock+AF8-write
+AD4- xprt+AF8-reserve+AF8-xprt (locks rpc+AF8-xprt)
+AD4- xprt+AF8-alloc+AF8-slot
+AD4- 						     +AF8AXw-rpc+AF8-execute
+AD4- 						     ...
+AD4- 						     call+AF8-reserve
+AD4- 						     xprt+AF8-reserve
+AD4- 						     xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot
+AD4- 						     xprt+AF8-lock+AF8-write
+AD4- 						     xprt+AF8-reserve+AF8-xprt
+AD4- 						     rpc+AF8-sleep+AF8-on (+ACI-sending+ACI- wait queue)
+AD4- 						     out+AF8-of+AF8-line+AF8-wait+AF8-on+AF8-bit
+AD4- xprt+AF8-release+AF8-write
+AD4- xprt+AF8-release+AF8-xprt
+AD4- xprt+AF8-clear+AF8-locked
+AD4- +AF8AXw-xprt+AF8-lock+AF8-write+AF8-next
+AD4- 						     (receives SIGKILL from a 3rd thread)
+AD4- 						     rpc+AF8-wait+AF8-bit+AF8-killable
+AD4- 						      (will return ERESTARTSYS b/c of SIGNAL)
+AD4- rpc+AF8-wake+AF8-up+AF8-first (wq:sending,
+AD4-                     f:+AF8AXw-xprt+AF8-lock+AF8-write+AF8-func) 
+AD4- +AF8AXw-xprt+AF8-lock+AF8-write+AF8-func (w/ task from sending)
+AD4-                                                      rpc+AF8-exit
+AD4- 						     rpc+AF8-exit+AF8-task
+AD4- ...                                                  ...
+AD4- 
+AD4- At this point the first thread has updated rpc+AF8-xprt-+AD4-snd+AF8-task to point to the
+AD4- second thread, and left the rpc+AF8-xprt locked, but the second thread is exiting
+AD4- because of the SIGKILL.  The real issue is the second thread was added to the
+AD4- +ACI-sending+ACI- wait queue (by rpc+AF8-sleep+AF8-on), but never actually made it to sleep.
+AD4- 
+AD4- The net effect of this is that the transport remains locked by a phantom
+AD4- rpc+AF8-task.  Any other requests that come in now just go to sleep on the
+AD4- rpc+AF8-xprt's sending wait queue, and no progress is made.  Any mount point that
+AD4- winds up using this rpc+AF8-xprt is now hung.
+AD4- 
+AD4- To help me understand what was happening, I used the attached systemtap script
+AD4- +AFs-3+AF0- to watch the system and upon each return from +AF8AXw-rpc+AF8-execute, check to see
+AD4- if our task is the task currently listed in the associated rpc+AF8-xprt's snd+AF8-task
+AD4- member, as I believe this should never be the case.  Running that script
+AD4- produced output like the following when it hung:
+AD4- 
+AD4- 
+AD4- +AFs-cperl+AEA-cesium +AH4-/tmp/2013-01-05+AF0AJA- sudo stap -DMAXMAPENTRIES+AD0-4096 -g ./catch+AF8-locked.stp --vp 00101
+AD4- Pass 3: translated to C into +ACI-/tmp/stapMZvjDF/stap+AF8-23115b4269a0722946b88ddf556678e4+AF8-18614+AF8-src.c+ACI- using 354096virt/32668res/6728shr/25096data kb, in 90usr/10sys/101real ms.
+AD4- Pass 5: starting run.
+AD4- 1357430209194595979: 30246: thread-pool thr -+AD4- +AF8AXw-rpc+AF8-execute: task+AD0-0xffff88011f39ec00: GETATTR
+AD4- 1357430209194599327: 30246: thread-pool thr    tk+AF8-callback: 0x0, tk+AF8-action: call+AF8-start
+AD4- 1357430209194602139: 30246: thread-pool thr    tk+AF8-callback: 0x0, tk+AF8-action: call+AF8-reserve
+AD4- 1357430209195083066: 30246: thread-pool thr    status: -512
+AD4- 1357430209195085888: 30246: thread-pool thr    -+AD4- rpc+AF8-exit:
+AD4- 1357430209195088144: 30246: thread-pool thr    tk+AF8-callback: 0x0, tk+AF8-action: rpc+AF8-exit+AF8-task
+AD4- 1357430209195090865: 30246: thread-pool thr    tk+AF8-callback: 0x0, tk+AF8-action: 0x0
+AD4- 1357430209195093517: 30246: thread-pool thr +ADw-- +AF8AXw-rpc+AF8-execute: tk+AF8-status: 0x0
+AD4- +AF4-CPass 5: run completed in 10usr/30sys/33870real ms.
+AD4- 
+AD4- After this happened, I used crash on the live system and indeed, this task was
+AD4- the one that was listed in rpc+AF8-xprt-+AD4-snd+AF8-task for the transport.  The rest of
+AD4- the circumstances of the hang I inferred by looking at the code, so I
+AD4- could definitely be wrong.
+AD4- 
+AD4- The systemtap script places probe points at specific lines within the
+AD4- code (to be able to look at tk+AF8-action and tk+AF8-callback), so it would need
+AD4- to be tweaked if run on a different kernel than the fedora 17 kernel I'm
+AD4- running it against.

Hi Chris,

Excellent sleuthing+ACE- Given the thoroughness of your explanation, I'm
pretty sure that the attached patch should fix the problem.

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
www.netapp.com
From ec8cbb4aff21cd0eac2c6f3fc4273ac72cdd91ef Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
Date: Mon, 7 Jan 2013 14:30:46 -0500
Subject: [PATCH] SUNRPC: Ensure we release the socket write lock if the
 rpc_task exits early

If the rpc_task exits while holding the socket write lock before it has
allocated an rpc slot, then the usual mechanism for releasing the write
lock in xprt_release() is defeated.

The problem occurs if the call to xprt_lock_write() initially fails, so
that the rpc_task is put on the xprt->sending wait queue. If the task
exits after being assigned the lock by __xprt_lock_write_func, but
before it has retried the call to xprt_lock_and_alloc_slot(), then
it calls xprt_release() while holding the write lock, but will
immediately exit due to the test for task->tk_rqstp != NULL.

Reported-by: Chris Perl <chris.perl@xxxxxxxxx>
Signed-off-by: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx [>= 3.1]
---
 net/sunrpc/xprt.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index bd462a5..6676457 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1136,10 +1136,12 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 void xprt_release(struct rpc_task *task)
 {
 	struct rpc_xprt	*xprt;
-	struct rpc_rqst	*req;
+	struct rpc_rqst	*req = task->tk_rqstp;
 
-	if (!(req = task->tk_rqstp))
+	if (req == NULL) {
+		xprt_release_write(task->tk_xprt, task);
 		return;
+	}
 
 	xprt = req->rq_xprt;
 	if (task->tk_ops->rpc_count_stats != NULL)
-- 
1.7.11.7