On Thu, 18 Sep 2014 08:01:07 -0400 Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx> wrote: > On Thu, 18 Sep 2014 16:03:17 +1000 > NeilBrown <neilb@xxxxxxx> wrote: > > > Support for loop-back mounted NFS filesystems is useful when NFS is > > used to access shared storage in a high-availability cluster. > > > > If the node running the NFS server fails, some other node can mount the > > filesystem and start providing NFS service. If that node already had > > the filesystem NFS mounted, it will now have it loop-back mounted. > > > > nfsd can suffer a deadlock when allocating memory and entering direct > > reclaim. > > While direct reclaim does not write to the NFS filesystem it can send > > and wait for a COMMIT through nfs_release_page(). > > > > This patch modifies nfs_release_page() to wait a limited time for the > > commit to complete - one second. If the commit doesn't complete > > in this time, nfs_release_page() will fail. This means it might now > > fail in some cases where it wouldn't before. These cases are only > > when 'gfp' includes '__GFP_WAIT'. > > > > nfs_release_page() is only called by try_to_release_page(), and that > > can only be called on an NFS page with required 'gfp' flags from > > - page_cache_pipe_buf_steal() in splice.c > > - shrink_page_list() in vmscan.c > > - invalidate_inode_pages2_range() in truncate.c > > > > The first two handle failure quite safely. The last is only called > > after ->launder_page() has been called, and that will have waited > > for the commit to finish already. > > > > So aborting if the commit takes longer than 1 second is perfectly safe. > > > > If nfs_release_page() is called on a sequence of pages which are all > > in the same file which is blocked on COMMIT, each page could > > contribute a 1 second delay which could be come excessive. I have > > seen delays of as much as 208 seconds. > > > > To keep the delay to one second, the bdi is marked as write-congested > > if the commit didn't finished. Once it does finish, the > > write-congested flag will be cleared. > > > > With this, the longest total delay in try_to_free_pages that I have > > seen in under 3 seconds. With no waiting in nfs_release_page at all > > I have seen delays of nearly 1.5 seconds. > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > > --- > > fs/nfs/file.c | 30 ++++++++++++++++++++---------- > > fs/nfs/write.c | 7 +++++++ > > 2 files changed, 27 insertions(+), 10 deletions(-) > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c > > index 524dd80d1898..febba950d8a6 100644 > > --- a/fs/nfs/file.c > > +++ b/fs/nfs/file.c > > @@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp) > > > > dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page); > > > > - /* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not > > - * doing this memory reclaim for a fs-related allocation. > > + /* Always try to initiate a 'commit' if relevant, but only > > + * wait for it if __GFP_WAIT is set and the calling process is > > + * allowed to block. Even then, only wait 1 second and only > > + * if the 'bdi' is not congested. > > + * Waiting indefinitely can cause deadlocks when the NFS > > + * server is on this machine, and there is no particular need > > + * to wait extensively here. A short wait has the benefit > > + * that someone else can worry about the freezer. > > */ > > - if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL && > > - !(current->flags & PF_FSTRANS)) { > > - int how = FLUSH_SYNC; > > - > > - /* Don't let kswapd deadlock waiting for OOM RPC calls */ > > - if (current_is_kswapd()) > > - how = 0; > > - nfs_commit_inode(mapping->host, how); > > + if (mapping) { > > + struct nfs_server *nfss = NFS_SERVER(mapping->host); > > + nfs_commit_inode(mapping->host, 0); > > + if ((gfp & __GFP_WAIT) && > > + !current_is_kswapd() && > > + !(current->flags & PF_FSTRANS) && > > + !bdi_write_congested(&nfss->backing_dev_info)) > > + wait_on_page_bit_killable_timeout(page, PG_private, > > + HZ); > > + if (PagePrivate(page)) > > + set_bdi_congested(&nfss->backing_dev_info, > > + BLK_RW_ASYNC); > > I've never had a great feel for the BDI congestion stuff, but won't > this have some unintended effects? > > For instance, suppose the VM decides to try to free this page and > passes in a gfp mask that doesn't contain __GFP_WAIT. We issue the > COMMIT, but don't wait for it. The COMMIT is actually going to go > reasonably fast, but we now set the BDI congested because we didn't > wait for it to occur. > > That in turn causes writeout for other inodes on this BDI to get > throttled even though there really is no congestion. It just looks that > way due to how releasepage got called. > > Am I making mountains out of molehills here? Excellent molehill - thanks :-) I was being lazy. The 'if (PagePrivate())' should really be inside the other if statement with the wait_on_page_bit...(). I've moved it there. Once I get an Ack for the mm bits I'll report it all. Thanks, NeilBrown (Molehills are worse for your suspension than mountains!) > > > } > > /* If PagePrivate() is set, then the page is not freeable */ > > if (PagePrivate(page)) > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > index 175d5d073ccf..3066c7fcb565 100644 > > --- a/fs/nfs/write.c > > +++ b/fs/nfs/write.c > > @@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req) > > if (likely(!PageSwapCache(head->wb_page))) { > > set_page_private(head->wb_page, 0); > > ClearPagePrivate(head->wb_page); > > + smp_mb__after_atomic(); > > + wake_up_page(head->wb_page, PG_private); > > clear_bit(PG_MAPPED, &head->wb_flags); > > } > > nfsi->npages--; > > @@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data) > > struct nfs_page *req; > > int status = data->task.tk_status; > > struct nfs_commit_info cinfo; > > + struct nfs_server *nfss; > > > > while (!list_empty(&data->pages)) { > > req = nfs_list_entry(data->pages.next); > > @@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data) > > next: > > nfs_unlock_and_release_request(req); > > } > > + nfss = NFS_SERVER(data->inode); > > + if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) > > + clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC); > > + > > nfs_init_cinfo(&cinfo, data->inode, data->dreq); > > if (atomic_dec_and_test(&cinfo.mds->rpcs_out)) > > nfs_commit_clear_lock(NFS_I(data->inode)); > > > > > >
Attachment:
signature.asc
Description: PGP signature