On Oct 25, 2012, at 2:17 PM, "Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> wrote: > On Thu, 2012-10-25 at 14:02 -0400, Weston Andros Adamson wrote: >> Calling nfs_kill_super from an RPC task callback would result in a deadlock >> where nfs_free_server (via rpc_shutdown_client) tries to kill all >> RPC tasks associated with that connection - including itself! >> >> Instead of calling nfs_kill_super directly, queue a job on the nfsiod >> workqueue. >> >> Signed-off-by: Weston Andros Adamson <dros@xxxxxxxxxx> >> --- >> >> This fixes the current incarnation of the lockup I've been tracking down for >> some time now. I still have to go back and see why the reproducer changed >> behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but >> now (before this patch) are stuck in rpc_exit. >> >> The reproducer works against a server with write delegations: >> >> ./nfsometer.py -m v4 server:/path dd_100m_100k >> >> which is basically: >> - mount >> - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024 >> - umount >> - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this >> case it NEVER goes away) >> >> There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the >> done handler calls nfs_sb_deactivate... >> >> I've tested this approach with 10 runs X 3 nfs versions X 5 workloads >> (dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident >> its correct. >> >> One question for the list: should nfs_free_server *always* be scheduled on >> the nfsiod workqueue? It's called in error paths in several locations. >> After looking at them, I don't think my approach would break anything, but >> some might have style objections. >> > > This doesn't add up. There should be nothing calling nfs_sb_deactive() > from a rpc_call_done() callback. If so, then that would be the bug. > > All calls to things like rpc_put_task(), put_nfs_open_context(), dput(), > or nfs_sb_deactive() should occur in the rpc_call_release() callback if > they can't be done in a process context. In both those cases, the > rpc_task will be invisible to rpc_killall_tasks and rpc_shutdown_client. Ah, I misunderstood what was going on here. nfs_kill_super *is* being called by rpc_release_calldata callback: The kworker stuck in rpc_killall_tasks forever: [ 34.552600] [<ffffffffa00868e6>] rpc_killall_tasks+0x2d/0xcd [sunrpc] [ 34.552608] [<ffffffffa00883e4>] rpc_shutdown_client+0x4a/0xec [sunrpc] [ 34.552615] [<ffffffffa002b973>] nfs_free_server+0xcf/0x133 [nfs] [ 34.552625] [<ffffffffa0033193>] nfs_kill_super+0x37/0x3c [nfs] [ 34.552629] [<ffffffff81136c68>] deactivate_locked_super+0x37/0x63 [ 34.552633] [<ffffffff8113785f>] deactivate_super+0x37/0x3b [ 34.552642] [<ffffffffa0034fc1>] nfs_sb_deactive+0x23/0x25 [nfs] [ 34.552649] [<ffffffffa00dbba2>] nfs4_free_closedata+0x53/0x63 [nfsv4] [ 34.552661] [<ffffffffa008f997>] rpc_release_calldata+0x17/0x19 [sunrpc] [ 34.552671] [<ffffffffa008f9f5>] rpc_free_task+0x5c/0x65 [sunrpc] [ 34.552680] [<ffffffffa008fe07>] rpc_async_release+0x15/0x17 [sunrpc] [ 34.552684] [<ffffffff810632b7>] process_one_work+0x192/0x2a0 [ 34.552693] [<ffffffffa008fdf2>] ? rpc_async_schedule+0x33/0x33 [sunrpc] [ 34.552697] [<ffffffff81064169>] worker_thread+0x140/0x1d7 [ 34.552700] [<ffffffff81064029>] ? manage_workers+0x23b/0x23b [ 34.552704] [<ffffffff81067d21>] kthread+0x8d/0x95 [ 34.552708] [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43 [ 34.552713] [<ffffffff814ef1ac>] ret_from_fork+0x7c/0xb0 [ 34.552717] [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43 And the client's task list: [ 174.574006] -pid- flgs status -client- --rqstp- -timeout ---ops-- [ 174.574019] 1664 0181 -5 ffff880226474600 (null) 0 ffffffffa00f7ce0 nfsv4 DELEGRETURN a:rpc_exit_task [sunrpc] q:none So it looks like a CLOSE's rpc_release_calldata is triggering the nfs_kill_super and this is stuck trying to kill the DELEGRETURN task - which never gets run. I've debugged this from the workqueue side and the DELEGRETURN work is scheduled, but ends up having insert_wq_barrier called on it. It seems to me that this means the work queue is enforcing queue ordering that the CLOSE work should complete before the DELEGRETURN work can proceed -- and *that* is the deadlock (CLOSE waiting until DELEGRETURN is dead, DELEGRETURN can't run until close is complete). This would also explain our (Trond and me) failed attempts at canceling / rescheduling jobs in killall_tasks -- insert_wq_barrier's comment states: * Currently, a queued barrier can't be canceled. This is because * try_to_grab_pending() can't determine whether the work to be * grabbed is at the head of the queue and thus can't clear LINKED * flag of the previous work while there must be a valid next work * after a work with LINKED flag set. Now that I have a better understanding of what's happening, I'll go back to the drawing board. Thanks! -dros-- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html