Re: [PATCH 09/11] NFS: Improve performance of listing directories being modified

David Wysochanski <dwysocha@xxxxxxxxxx> · Mon, 2 Nov 2020 17:05:31 -0500

On Mon, Nov 2, 2020 at 4:30 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Mon, 2020-11-02 at 14:45 -0500, David Wysochanski wrote:
> > On Mon, Nov 2, 2020 at 12:31 PM Trond Myklebust <
> > trondmy@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Mon, 2020-11-02 at 11:26 -0500, David Wysochanski wrote:
> > > > On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust <
> > > > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote:
> > > > > > A process can hang forever to 'ls -l' a directory while the
> > > > > > directory
> > > > > > is being modified such as another NFS client adding files to
> > > > > > the
> > > > > > directory.  The problem is seen specifically with larger
> > > > > > directories
> > > > > > (I tested with 1 million) and/or slower NFS server responses
> > > > > > to
> > > > > > READDIR.  If a combination of the NFS directory size, the NFS
> > > > > > server
> > > > > > responses to READDIR is such that the 'ls' process gets
> > > > > > partially
> > > > > > through the listing before the attribute cache expires (time
> > > > > > exceeds acdirmax), we drop the pagecache and have to re-fill
> > > > > > it,
> > > > > > and as a result, the process may never complete.  One could
> > > > > > argue
> > > > > > for larger directories the acdirmin/acdirmax should be
> > > > > > increased,
> > > > > > but it's not always possible to tune this effectively.
> > > > > >
> > > > > > The root cause of this problem is due to how the NFS readdir
> > > > > > cache
> > > > > > currently works.  The main search function,
> > > > > > readdir_search_pagecache(),
> > > > > > always starts searching at page_index and cookie == 0, and
> > > > > > for
> > > > > > any
> > > > > > page not in the cache, fills in the page with entries
> > > > > > obtained in
> > > > > > a READDIR NFS call.  If a page already exists, we proceed to
> > > > > > nfs_readdir_search_for_cookie(), which searches for the
> > > > > > cookie
> > > > > > (pos) of the readdir call.  The search is O(n), where n is
> > > > > > the
> > > > > > directory size before the cookie in question is found, and
> > > > > > every
> > > > > > entry to nfs_readdir() pays this penalty, irrespective of the
> > > > > > current directory position (dir_context.pos).  The search is
> > > > > > expensive due to the opaque nature of readdir cookies, and
> > > > > > the
> > > > > > fact
> > > > > > that no mapping (hash) exists from cookies to pages.  In the
> > > > > > case
> > > > > > of a directory being modified, the above behavior can become
> > > > > > an
> > > > > > excessive penalty, since the same process is forced to fill
> > > > > > pages
> > > > > > it
> > > > > > may be no longer interested in (the entries were passed in a
> > > > > > previous
> > > > > > nfs_readdir call), and this can essentially lead no forward
> > > > > > progress.
> > > > > >
> > > > > > To fix this problem, at the end of nfs_readdir(), save the
> > > > > > page_index
> > > > > > corresponding to the directory position (cookie) inside the
> > > > > > process's
> > > > > > nfs_open_dir_context.  Then at the next entry of
> > > > > > nfs_readdir(),
> > > > > > use
> > > > > > the saved page_index as the starting search point rather than
> > > > > > starting
> > > > > > at page_index == 0.  Not only does this fix the problem of
> > > > > > listing
> > > > > > a directory being modified, it also significantly improves
> > > > > > performance
> > > > > > in the unmodified case since no extra search penalty is paid
> > > > > > at
> > > > > > each
> > > > > > entry to nfs_readdir().
> > > > > >
> > > > > > In the case of lseek, since there is no hash or other mapping
> > > > > > from a
> > > > > > cookie value to the page->index, just reset
> > > > > > nfs_open_dir_context.page_index
> > > > > > to 0, which will reset the search to the old behavior.
> > > > > >
> > > > > > Signed-off-by: Dave Wysochanski <dwysocha@xxxxxxxxxx>
> > > > > > ---
> > > > > >  fs/nfs/dir.c           | 8 +++++++-
> > > > > >  include/linux/nfs_fs.h | 1 +
> > > > > >  2 files changed, 8 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > index 52e06c8fc7cd..b266f505b521 100644
> > > > > > --- a/fs/nfs/dir.c
> > > > > > +++ b/fs/nfs/dir.c
> > > > > > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context
> > > > > > *alloc_nfs_open_dir_context(struct inode *dir
> > > > > >                 ctx->attr_gencount = nfsi->attr_gencount;
> > > > > >                 ctx->dir_cookie = 0;
> > > > > >                 ctx->dup_cookie = 0;
> > > > > > +               ctx->page_index = 0;
> > > > > >                 ctx->cred = get_cred(cred);
> > > > > >                 spin_lock(&dir->i_lock);
> > > > > >                 if (list_empty(&nfsi->open_files) &&
> > > > > > @@ -763,7 +764,7 @@ int
> > > > > > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc)
> > > > > >         return res;
> > > > > >  }
> > > > > >
> > > > > > -/* Search for desc->dir_cookie from the beginning of the
> > > > > > page
> > > > > > cache
> > > > > > */
> > > > > > +/* Search for desc->dir_cookie starting at desc->page_index
> > > > > > */
> > > > > >  static inline
> > > > > >  int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
> > > > > >  {
> > > > > > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file *file,
> > > > > > struct
> > > > > > dir_context *ctx)
> > > > > >                 .ctx = ctx,
> > > > > >                 .dir_cookie = &dir_ctx->dir_cookie,
> > > > > >                 .plus = nfs_use_readdirplus(inode, ctx),
> > > > > > +               .page_index = dir_ctx->page_index,
> > > > > > +               .last_cookie = nfs_readdir_use_cookie(file) ?
> > > > > > ctx-
> > > > > > > pos : 0,
> > > > > >         },
> > > > > >                         *desc = &my_desc;
> > > > > >         int res = 0;
> > > > > > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file *file,
> > > > > > struct
> > > > > > dir_context *ctx)
> > > > > >  out:
> > > > > >         if (res > 0)
> > > > > >                 res = 0;
> > > > > > +       dir_ctx->page_index = desc->page_index;
> > > > > >         trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx-
> > > > > > > dir_cookie,
> > > > > >                                NFS_SERVER(inode)->dtsize,
> > > > > > my_desc.plus, res);
> > > > > >         return res;
> > > > > > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct file
> > > > > > *filp,
> > > > > > loff_t offset, int whence)
> > > > > >                 else
> > > > > >                         dir_ctx->dir_cookie = 0;
> > > > > >                 dir_ctx->duped = 0;
> > > > > > +               /* Force readdir_search_pagecache to start
> > > > > > over
> > > > > > */
> > > > > > +               dir_ctx->page_index = 0;
> > > > > >         }
> > > > > >         inode_unlock(inode);
> > > > > >         return offset;
> > > > > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > > > > > index a2c6455ea3fa..0e55c0154ccd 100644
> > > > > > --- a/include/linux/nfs_fs.h
> > > > > > +++ b/include/linux/nfs_fs.h
> > > > > > @@ -93,6 +93,7 @@ struct nfs_open_dir_context {
> > > > > >         __u64 dir_cookie;
> > > > > >         __u64 dup_cookie;
> > > > > >         signed char duped;
> > > > > > +       unsigned long   page_index;
> > > > > >  };
> > > > > >
> > > > > >  /*
> > > > >
> > > > > NACK. It makes no sense to store the page index as a cursor.
> > > > >
> > > >
> > > > A similar thing was done recently with:
> > > > 227823d2074d nfs: optimise readdir cache page invalidation
> > > >
> > >
> > > That's a very different thing. It is about discarding page data in
> > > order to force a re-read of the contents into cache.
> > >
> > Right - I only pointed it out because it is in effect a cursor about
> > the last access into the cache but it's on a global basis, not
> > process context.
> >
> > > What you're doing is basically trying to guess where the data is
> > > located. which might work in some cases where the directory is
> > > completely static, but if it shrinks (e.g. due to a few unlink() or
> > > rename() calls) so that you overshoot the cookie, then you can end
> > > up
> > > reading all the way to the end of the directory before doing an
> > > uncached readdir.
> > >
> > First, consider the unmodified (idle directory) scenario.  Today the
> > performance is bad the larger the directory goes - do you see why?
> > I tried to explain in the cover letter and header but maybe it's not
> > clear?
> >
> > Second, the modified scenario today the performance is very bad
> > because of the same problem - the cookie is reset and the process
> > needs to start over at cookie 0, repeating READDIRs.  But maybe
> > there's a specific scenario I'm not thinking about.
> >
> > The way I thought about this is that if you're in a heavily modified
> > scenario with a large directory and you're past the 'acdirmax' time,
> > you have to make the choice of either:
> > a) ignoring 'acdirmax' (this is what the NFSv3 patch did) and even
> > though you know the cache expired you keep going as though it
> > did not (at least until a different process starts a listing)
> > b) honoring 'acdirmax' (drop the pagecache), but keep going the
> > best you can based on the previous information and don't try to
> > rebuild the cache before continuing.
> >
> > > IOW: This will have a detrimental effect for some workloads, which
> > > needs to be weighed up against the benefits. I saw that you've
> > > tested
> > > with large directories, but what workloads were you testing on
> > > those
> > > directories?
> > >
> > I can definitely do further testing and any scenario you want to try
> > to
> > break it or find a pathological scenario. So far I've tested the
> > reader ("ls -lf") in parallel with one of the two writers:
> > 1) random add a file every 0.1s:
> > while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; touch
> > $MNT2/file$i.bin; builtin sleep 0.1; done > /dev/null 2>&1 &
> > 2) random delete a file every 0.1 s:
> > while true; do i=$((1 + RANDOM % $NUM_FILES)); echo $i; rm -f
> > $MNT2/file$i; builtin sleep 0.1; done > /dev/null 2>&1 &
> >
> > In no case did I see it take a longer time or ops vs vanilla 5.9, the
> > idle
> > and modified performance is better (measured in seconds and ops)
> > with this patch.  Below is a short summary.  Note that the first time
> > and
> > ops is with an idle directory, and the second one is the modified.
> >
> > 5.9 (vanilla): random delete a file every 0.1 s:
> > Ops increased from 4734 to 8834
> > Time increased from 23 to 44
> >
> > 5.9 (this patch): random delete a file every 0.1 s:
> > Ops increased from 4697 to 4696
> > Time increased from 20 to 30
> >
> >
> > 5.9 (vanilla): random add a file every 0.1s:
> > Ops increased from 4734 to 9168
> > Time increased from 23 to 43
> >
> > 5.9 (this patch): random add a file every 0.1s:
> > Ops increased from 4697 to 4702
> > Time increased from 21 to 32
> >
>
> If you're not seeing any change in number of ops then those numbers are
> basically telling you that you're not seeing any cache invalidation.
> You should be seeing cache invalidation when you are creating and
> deleting files and are doing simultaneous readdirs.
>

No I think you're misunderstanding or we're not on the same page.
The difference is, with 5.9 vanilla when the invalidation occurs, the
reader process has to go back to cookie 0 and pays the penalty
to refill all the cache pages up to cookie 'N'.  With the patch this
is not the case - it continues on with cookie 'N' and does not pay the
penalty - the next reader does.

> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>