Re: [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems.

NeilBrown <neilb@xxxxxxx> · Mon, 22 Sep 2014 11:37:07 +1000

On Thu, 18 Sep 2014 08:01:07 -0400 Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx>
wrote:

> On Thu, 18 Sep 2014 16:03:17 +1000
> NeilBrown <neilb@xxxxxxx> wrote:
> 
> > Support for loop-back mounted NFS filesystems is useful when NFS is
> > used to access shared storage in a high-availability cluster.
> > 
> > If the node running the NFS server fails, some other node can mount the
> > filesystem and start providing NFS service.  If that node already had
> > the filesystem NFS mounted, it will now have it loop-back mounted.
> > 
> > nfsd can suffer a deadlock when allocating memory and entering direct
> > reclaim.
> > While direct reclaim does not write to the NFS filesystem it can send
> > and wait for a COMMIT through nfs_release_page().
> > 
> > This patch modifies nfs_release_page() to wait a limited time for the
> > commit to complete - one second.  If the commit doesn't complete
> > in this time, nfs_release_page() will fail.  This means it might now
> > fail in some cases where it wouldn't before.  These cases are only
> > when 'gfp' includes '__GFP_WAIT'.
> > 
> > nfs_release_page() is only called by try_to_release_page(), and that
> > can only be called on an NFS page with required 'gfp' flags from
> >  - page_cache_pipe_buf_steal() in splice.c
> >  - shrink_page_list() in vmscan.c
> >  - invalidate_inode_pages2_range() in truncate.c
> > 
> > The first two handle failure quite safely.  The last is only called
> > after ->launder_page() has been called, and that will have waited
> > for the commit to finish already.
> > 
> > So aborting if the commit takes longer than 1 second is perfectly safe.
> > 
> > If nfs_release_page() is called on a sequence of pages which are all
> > in the same file which is blocked on COMMIT, each page could
> > contribute a 1 second delay which could be come excessive.  I have
> > seen delays of as much as 208 seconds.
> > 
> > To keep the delay to one second, the bdi is marked as write-congested
> > if the commit didn't finished.  Once it does finish, the
> > write-congested flag will be cleared.
> > 
> > With this, the longest total delay in try_to_free_pages that I have
> > seen in under 3 seconds.  With no waiting in nfs_release_page at all
> > I have seen delays of nearly 1.5 seconds.
> > 
> > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > ---
> >  fs/nfs/file.c  |   30 ++++++++++++++++++++----------
> >  fs/nfs/write.c |    7 +++++++
> >  2 files changed, 27 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > index 524dd80d1898..febba950d8a6 100644
> > --- a/fs/nfs/file.c
> > +++ b/fs/nfs/file.c
> > @@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
> >  
> >  	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
> >  
> > -	/* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
> > -	 * doing this memory reclaim for a fs-related allocation.
> > +	/* Always try to initiate a 'commit' if relevant, but only
> > +	 * wait for it if __GFP_WAIT is set and the calling process is
> > +	 * allowed to block.  Even then, only wait 1 second and only
> > +	 * if the 'bdi' is not congested.
> > +	 * Waiting indefinitely can cause deadlocks when the NFS
> > +	 * server is on this machine, and there is no particular need
> > +	 * to wait extensively here.  A short wait has the benefit
> > +	 * that someone else can worry about the freezer.
> >  	 */
> > -	if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
> > -	    !(current->flags & PF_FSTRANS)) {
> > -		int how = FLUSH_SYNC;
> > -
> > -		/* Don't let kswapd deadlock waiting for OOM RPC calls */
> > -		if (current_is_kswapd())
> > -			how = 0;
> > -		nfs_commit_inode(mapping->host, how);
> > +	if (mapping) {
> > +		struct nfs_server *nfss = NFS_SERVER(mapping->host);
> > +		nfs_commit_inode(mapping->host, 0);
> > +		if ((gfp & __GFP_WAIT) &&
> > +		    !current_is_kswapd() &&
> > +		    !(current->flags & PF_FSTRANS) &&
> > +		    !bdi_write_congested(&nfss->backing_dev_info))
> > +			wait_on_page_bit_killable_timeout(page, PG_private,
> > +							  HZ);
> > +		if (PagePrivate(page))
> > +			set_bdi_congested(&nfss->backing_dev_info,
> > +					  BLK_RW_ASYNC);
> 
> I've never had a great feel for the BDI congestion stuff, but won't
> this have some unintended effects?
> 
> For instance, suppose the VM decides to try to free this page and
> passes in a gfp mask that doesn't contain __GFP_WAIT. We issue the
> COMMIT, but don't wait for it. The COMMIT is actually going to go
> reasonably fast, but we now set the BDI congested because we didn't
> wait for it to occur.
> 
> That in turn causes writeout for other inodes on this BDI to get
> throttled even though there really is no congestion. It just looks that
> way due to how releasepage got called.
> 
> Am I making mountains out of molehills here?

Excellent molehill - thanks :-)

I was being lazy.  The 'if (PagePrivate())' should really be inside the
other if statement with the wait_on_page_bit...().  I've moved it there.
Once I get an Ack for the mm bits I'll report it all.

Thanks,
NeilBrown

(Molehills are worse for your suspension than mountains!)

> 
> >  	}
> >  	/* If PagePrivate() is set, then the page is not freeable */
> >  	if (PagePrivate(page))
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 175d5d073ccf..3066c7fcb565 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> >  		if (likely(!PageSwapCache(head->wb_page))) {
> >  			set_page_private(head->wb_page, 0);
> >  			ClearPagePrivate(head->wb_page);
> > +			smp_mb__after_atomic();
> > +			wake_up_page(head->wb_page, PG_private);
> >  			clear_bit(PG_MAPPED, &head->wb_flags);
> >  		}
> >  		nfsi->npages--;
> > @@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
> >  	struct nfs_page	*req;
> >  	int status = data->task.tk_status;
> >  	struct nfs_commit_info cinfo;
> > +	struct nfs_server *nfss;
> >  
> >  	while (!list_empty(&data->pages)) {
> >  		req = nfs_list_entry(data->pages.next);
> > @@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
> >  	next:
> >  		nfs_unlock_and_release_request(req);
> >  	}
> > +	nfss = NFS_SERVER(data->inode);
> > +	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> > +		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> > +
> >  	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> >  	if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
> >  		nfs_commit_clear_lock(NFS_I(data->inode));
> > 
> > 
> 
> 

Attachment:
signature.asc

Description: PGP signature