On Mon, 2013-01-07 at 13:58 -0500, Chris Perl wrote: +AD4- (Apologies up front for the long email, just trying to be thorough) +AD4- +AD4- I have come across a hang in the Linux NFS client that is reproducible. I've +AD4- been testing with a Fedora 17 system (3.6.10-2.fc17.x86+AF8-64), but I've also seen +AD4- the same issues on CentOS 6.3 systems (2.6.32-279.5.2.el6.centos.plus.x86+AF8-64). +AD4- I have not had a chance to test the latest 3.8 rc kernels yet, but quickly +AD4- browsing their code, I didn't see major changes in the key areas to make me +AD4- think it would make a difference (but I could certainly be wrong). +AD4- +AD4- Unfortunately, the problem is not easily reproducible. So far, the only way +AD4- that I've been able to trigger it is with the attached OCaml program +AFs-1+AF0-. +AD4- +AD4- Even more unfortunate, for you to compile and use that program, you'll need to +AD4- have a functioning OCaml environment with Jane Street's Core and Async +AD4- libraries available. I can give details about the easiest way to set that up +AD4- if you want, but will hold off on that for now. Please let me know if you want +AD4- these details. +AD4- +AD4- The idea of the program is that it spawns many threads that start stat'ing and +AD4- reading files over NFS (the paths to each file are command line args) and then +AD4- abruptly exits. +AD4- +AD4- Originally, I came across a system that hung and forced a crash dump of the +AD4- kernel that I then inspected with crash +AFs-2+AF0-. With crash I could see that the +AD4- rpc+AF8-task referenced by the snd+AF8-task member in the relevant rpc+AF8-xprt didn't +AD4- exist. +AD4- +AD4- Then after talking with some coworkers (and inspecting several more hung +AD4- systems), eventually came up with a way to reproduce the problem. To reproduce +AD4- the problem, you actually need to take the compiled OCaml program (which +AD4- purposely exits very quickly after starting) and wrap it in a loop that will +AD4- immediately restart it when it exits: +AD4- +AD4- +ACQ- while :+ADs- do ./hang.opt /mnt/nfs/file+AHs-01..10+AH0- +AD4-/dev/null 2+AD4AJg-1+ADs- done +AD4- +AD4- In this case, file01 through 10 are 10k text files filled entirely with the +AD4- character +ACI-A+ACI-. /mnt/nfs is the mount point with the following options +AD4- +AD4- +ACQ- nfsstat -m +AD4- /mnt/nfs from 127.0.0.1:/var/tmp +AD4- Flags: rw,relatime,vers+AD0-3rsize+AD0-524288,wsize+AD0-524288,namelen+AD0-255,acregmin+AD0-0,acregmax+AD0-0,acdirmin+AD0-0,acdirmax+AD0-0,hard,proto+AD0-tcp,timeo+AD0-600,retrans+AD0-2,sec+AD0-sys,mountaddr+AD0-127.0.0.1,mountvers+AD0-3,mountport+AD0-20048,mountproto+AD0-tcp,local+AF8-lock+AD0-none,addr+AD0-127.0.0.1 +AD4- +AD4- And the hang should happen within a few minutes (although sometimes it did +AD4- take quite a while for it to hang). +AD4- +AD4- I believe the problem is a race condition where one thread that currently has +AD4- the +AGA-rpc+AF8-xprt' for a given NFS mount locked tries to give it to another thread +AD4- on the +AGA-sending' wait queue for that +AGA-rpc+AF8-xprt' (via +AF8AXw-xprt+AF8-lock+AF8-write+AF8-next), +AD4- but the thread that is presumably sleeping on that wait queue never made it to +AD4- sleep because it received a SIGKILL before it could do so. +AD4- +AD4- Here is the interaction that I believe is causing the problem: +AD4- +AD4- thread 1 thread 2 +AD4- -------- -------- +AD4- +AF8AXw-rpc+AF8-execute +AD4- ... +AD4- call+AF8-reserve +AD4- xprt+AF8-reserve +AD4- xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot +AD4- xprt+AF8-lock+AF8-write +AD4- xprt+AF8-reserve+AF8-xprt (locks rpc+AF8-xprt) +AD4- xprt+AF8-alloc+AF8-slot +AD4- +AF8AXw-rpc+AF8-execute +AD4- ... +AD4- call+AF8-reserve +AD4- xprt+AF8-reserve +AD4- xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot +AD4- xprt+AF8-lock+AF8-write +AD4- xprt+AF8-reserve+AF8-xprt +AD4- rpc+AF8-sleep+AF8-on (+ACI-sending+ACI- wait queue) +AD4- out+AF8-of+AF8-line+AF8-wait+AF8-on+AF8-bit +AD4- xprt+AF8-release+AF8-write +AD4- xprt+AF8-release+AF8-xprt +AD4- xprt+AF8-clear+AF8-locked +AD4- +AF8AXw-xprt+AF8-lock+AF8-write+AF8-next +AD4- (receives SIGKILL from a 3rd thread) +AD4- rpc+AF8-wait+AF8-bit+AF8-killable +AD4- (will return ERESTARTSYS b/c of SIGNAL) +AD4- rpc+AF8-wake+AF8-up+AF8-first (wq:sending, +AD4- f:+AF8AXw-xprt+AF8-lock+AF8-write+AF8-func) +AD4- +AF8AXw-xprt+AF8-lock+AF8-write+AF8-func (w/ task from sending) +AD4- rpc+AF8-exit +AD4- rpc+AF8-exit+AF8-task +AD4- ... ... +AD4- +AD4- At this point the first thread has updated rpc+AF8-xprt-+AD4-snd+AF8-task to point to the +AD4- second thread, and left the rpc+AF8-xprt locked, but the second thread is exiting +AD4- because of the SIGKILL. The real issue is the second thread was added to the +AD4- +ACI-sending+ACI- wait queue (by rpc+AF8-sleep+AF8-on), but never actually made it to sleep. +AD4- +AD4- The net effect of this is that the transport remains locked by a phantom +AD4- rpc+AF8-task. Any other requests that come in now just go to sleep on the +AD4- rpc+AF8-xprt's sending wait queue, and no progress is made. Any mount point that +AD4- winds up using this rpc+AF8-xprt is now hung. +AD4- +AD4- To help me understand what was happening, I used the attached systemtap script +AD4- +AFs-3+AF0- to watch the system and upon each return from +AF8AXw-rpc+AF8-execute, check to see +AD4- if our task is the task currently listed in the associated rpc+AF8-xprt's snd+AF8-task +AD4- member, as I believe this should never be the case. Running that script +AD4- produced output like the following when it hung: +AD4- +AD4- +AD4- +AFs-cperl+AEA-cesium +AH4-/tmp/2013-01-05+AF0AJA- sudo stap -DMAXMAPENTRIES+AD0-4096 -g ./catch+AF8-locked.stp --vp 00101 +AD4- Pass 3: translated to C into +ACI-/tmp/stapMZvjDF/stap+AF8-23115b4269a0722946b88ddf556678e4+AF8-18614+AF8-src.c+ACI- using 354096virt/32668res/6728shr/25096data kb, in 90usr/10sys/101real ms. +AD4- Pass 5: starting run. +AD4- 1357430209194595979: 30246: thread-pool thr -+AD4- +AF8AXw-rpc+AF8-execute: task+AD0-0xffff88011f39ec00: GETATTR +AD4- 1357430209194599327: 30246: thread-pool thr tk+AF8-callback: 0x0, tk+AF8-action: call+AF8-start +AD4- 1357430209194602139: 30246: thread-pool thr tk+AF8-callback: 0x0, tk+AF8-action: call+AF8-reserve +AD4- 1357430209195083066: 30246: thread-pool thr status: -512 +AD4- 1357430209195085888: 30246: thread-pool thr -+AD4- rpc+AF8-exit: +AD4- 1357430209195088144: 30246: thread-pool thr tk+AF8-callback: 0x0, tk+AF8-action: rpc+AF8-exit+AF8-task +AD4- 1357430209195090865: 30246: thread-pool thr tk+AF8-callback: 0x0, tk+AF8-action: 0x0 +AD4- 1357430209195093517: 30246: thread-pool thr +ADw-- +AF8AXw-rpc+AF8-execute: tk+AF8-status: 0x0 +AD4- +AF4-CPass 5: run completed in 10usr/30sys/33870real ms. +AD4- +AD4- After this happened, I used crash on the live system and indeed, this task was +AD4- the one that was listed in rpc+AF8-xprt-+AD4-snd+AF8-task for the transport. The rest of +AD4- the circumstances of the hang I inferred by looking at the code, so I +AD4- could definitely be wrong. +AD4- +AD4- The systemtap script places probe points at specific lines within the +AD4- code (to be able to look at tk+AF8-action and tk+AF8-callback), so it would need +AD4- to be tweaked if run on a different kernel than the fedora 17 kernel I'm +AD4- running it against. Hi Chris, Excellent sleuthing+ACE- Given the thoroughness of your explanation, I'm pretty sure that the attached patch should fix the problem. Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust+AEA-netapp.com www.netapp.com
From ec8cbb4aff21cd0eac2c6f3fc4273ac72cdd91ef Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> Date: Mon, 7 Jan 2013 14:30:46 -0500 Subject: [PATCH] SUNRPC: Ensure we release the socket write lock if the rpc_task exits early If the rpc_task exits while holding the socket write lock before it has allocated an rpc slot, then the usual mechanism for releasing the write lock in xprt_release() is defeated. The problem occurs if the call to xprt_lock_write() initially fails, so that the rpc_task is put on the xprt->sending wait queue. If the task exits after being assigned the lock by __xprt_lock_write_func, but before it has retried the call to xprt_lock_and_alloc_slot(), then it calls xprt_release() while holding the write lock, but will immediately exit due to the test for task->tk_rqstp != NULL. Reported-by: Chris Perl <chris.perl@xxxxxxxxx> Signed-off-by: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> Cc: stable@xxxxxxxxxxxxxxx [>= 3.1] --- net/sunrpc/xprt.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c index bd462a5..6676457 100644 --- a/net/sunrpc/xprt.c +++ b/net/sunrpc/xprt.c @@ -1136,10 +1136,12 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt) void xprt_release(struct rpc_task *task) { struct rpc_xprt *xprt; - struct rpc_rqst *req; + struct rpc_rqst *req = task->tk_rqstp; - if (!(req = task->tk_rqstp)) + if (req == NULL) { + xprt_release_write(task->tk_xprt, task); return; + } xprt = req->rq_xprt; if (task->tk_ops->rpc_count_stats != NULL) -- 1.7.11.7