Re: NFS deadlock between 'sync' and commit after unmount....

Jan Kara <jack@xxxxxxx> · Thu, 10 Apr 2014 23:25:03 +0200

On Mon 07-04-14 19:07:35, Trond Myklebust wrote:
> On Apr 7, 2014, at 18:35, Jan Kara <jack@xxxxxxx> wrote:
> 
> > On Mon 07-04-14 18:02:16, Trond Myklebust wrote:
> >> On Mon, 2014-04-07 at 22:27 +0200, Jan Kara wrote:
> >>> On Mon 07-04-14 10:10:27, Trond Myklebust wrote:
> >>>> On Apr 6, 2014, at 23:50, NeilBrown <neilb@xxxxxxx> wrote:
> >>>>> I've just hit a deadlock in NFS that seems very strange.
> >>>>> The kernel is 3.14-rc8 which some local changes which shouldn't affect the
> >>>>> deadlocking code.
> >>>>> 
> >>>>> Shortly after umounting the NFS filesystem with "umount -f" (though I don't
> >>>>> think the -f is important), I ran "sync".
> >>>>> 
> >>>>> The sync is now stuck in
> >>>>> 
> >>>>> [<ffffffff81197fc1>] sync_inodes_sb+0xa1/0x1c0
> >>>>> [<ffffffff8119cd99>] sync_inodes_one_sb+0x19/0x20
> >>>>> [<ffffffff81173372>] iterate_supers+0xb2/0x110
> >>>>> [<ffffffff8119cfd0>] sys_sync+0x30/0x90
> >>>>> [<ffffffff81aa4622>] system_call_fastpath+0x16/0x1b
> >>>>> [<ffffffffffffffff>] 0xffffffffffffffff
> >>>>> 
> >>>>> while kworker/u16:1 is stuck:
> >>>>> 
> >>>>> [<ffffffff815420b3>] call_rwsem_down_write_failed+0x13/0x20
> >>>>> [<ffffffff81172889>] deactivate_super+0x39/0x60
> >>>>> [<ffffffff812d56f1>] nfs_sb_deactive+0x21/0x30
> >>>>> [<ffffffff812d2ef9>] __put_nfs_open_context+0xc9/0x100
> >>>>> [<ffffffff812d2f3b>] put_nfs_open_context+0xb/0x10
> >>>>> [<ffffffff812ddd14>] nfs_commitdata_release+0x14/0x30
> >>>>> [<ffffffff812ddd4a>] nfs_commit_release+0x1a/0x20
> >>>>> [<ffffffff81a45a05>] rpc_free_task+0x25/0x70
> >>>>> [<ffffffff81a45fd8>] rpc_do_put_task+0x78/0x80
> >>>>> [<ffffffff81a45feb>] rpc_put_task+0xb/0x10
> >>>>> [<ffffffff812de3fe>] nfs_initiate_commit+0xce/0x110
> >>>>> [<ffffffff812df112>] nfs_commit_list+0x62/0x90
> >>>>> [<ffffffff812dfd26>] nfs_commit_inode+0xa6/0x170
> >>>>> [<ffffffff812dfe4d>] nfs_write_inode+0x5d/0xa0
> >>>>> [<ffffffff81300d69>] nfs4_write_inode+0x9/0x10
> >>>>> [<ffffffff811978ec>] __writeback_single_inode+0x10c/0x2c0
> >>>>> [<ffffffff811987ea>] writeback_sb_inodes+0x2ca/0x450
> >>>>> [<ffffffff81198b2c>] wb_writeback+0xec/0x320
> >>>>> [<ffffffff81199365>] bdi_writeback_workfn+0x115/0x4c0
> >>>>> [<ffffffff810a595b>] process_one_work+0x16b/0x430
> >>>>> [<ffffffff810a6619>] worker_thread+0x119/0x3a0
> >>>>> [<ffffffff810ac2bd>] kthread+0xcd/0xf0
> >>>>> [<ffffffff81aa457c>] ret_from_fork+0x7c/0xb0
> >>>>> [<ffffffffffffffff>] 0xffffffffffffffff
> >>>>> 
> >>>>> 
> >>>>> So sync is holding sb->s_umount, queued some bdi work on the filesystem
> >>>>> and is waiting for it to complete.  Mean while, that work has (I think)
> >>>>> submitted a 'commit' (via ->write_inode) and that commit wants to
> >>>>> deactivate_super and so needs to get ->s_umount.
> >>>>> 
> >>>>> I suspect this could happen even more easily with a lazy unmount.
> >>>>> 
> >>>>> It seems that this commit request is that last thing that is keeping
> >>>>> ->s_active elevated and it deadlocks trying to drop the last s_active.
> >>>>> 
> >>>>> I have no idea how to fix it....  help?
> >>>>> 
> >>>> 
> >>>> The problem seems to be the use of iterate_supers(), which grabs a
> >>>> passive reference, and conflicts with our use of an active reference in
> >>>> the open context.
> >>>  Yeah, we cannot really do otherwise in iterate_supers() - we have to grab
> >>> some superblock reference and we don't really want to get an active one
> >>> since that would result in spurious EBUSY returns from umount.
> >>> 
> >>> Cannot we just punt the deactivate_super() call to a workqueue to avoid
> >>> this deadlock? It's a bit ugly but it should do the trick. Or is
> >>> nfs_sb_deactive() called too often and we'd see some adverse effects for
> >>> that? We could also offload it to workqueue only in the special case where
> >>> sb->s_active == 1. That should be really rare then but it's a bit ugly
> >>> poking in VFS internals.
> >> 
> >> The activate/deactivate super is basically there to save our bacon when
> >> NFS file state extends beyond the usual VFS path walk, open() and
> >> close(). Examples include sillyrename and NFSv4 delegations. Even
> >> ordinary read and write state can extend beyond close() if the user
> >> decides to 'kill -9' in the wrong places.
> >> In most of these situations, we need to keep a dentry around until we're
> >> finished, which means that we want to keep the super block alive too.
> >  Yeah, that makes sense. But offloading dropping of sb reference to a
> > workqueue would work then, wouldn't it?
> 
> Could we perhaps have a helper in the VFS that can optimise away the case
> where s->s_active > 1?
  I'm not sure how you'd imagine the optimisation in VFS. But what I had in
mind was something like:

void nfs_deactivate_super(struct super_block *sb)
{
	if (!atomic_add_unless(&sb->s_active, -1, 1)) {
		/*
		 * Postpone deactivation to workqueue to avoid deadlocking
		 * on s_umount semaphore - we can get here when trying to
		 * complete sync(2) request for forcefully unmounted
		 * filesystem.
		 */
		schedule_work(&NFS_SB(sb)->deactivate_work);
	}
}

static void nfs_deactivate_sb_work(struct work_struct *work)
{
	struct super_block *sb = container_of(work, struct nfs_server,
					      deactivate_work)->super;

	deactivate_super(sb);
}

in nfs_initialise_sb():
	INIT_WORK(&NFS_SB(sb)->deactivate_work, nfs_deactivate_sb_work);

and then use nfs_deactive_super() instead of deactivate_super(). That
should do the trick and do the offloading only if we are really dropping
the last reference.

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html