Re: [PATCH v4 21/21] NFS: Do uncached readdir when we're seeking a cookie in an empty page cache

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Wed, 11 Nov 2020 20:11:06 +0000

On Wed, 2020-11-11 at 14:53 -0500, Benjamin Coddington wrote:
> On 11 Nov 2020, at 12:34, Trond Myklebust wrote:
> 
> > On Wed, 2020-11-11 at 11:43 -0500, Benjamin Coddington wrote:
> > > On 9 Nov 2020, at 16:46, Trond Myklebust wrote:
> > > 
> > > > On Mon, 2020-11-09 at 16:41 -0500, Benjamin Coddington wrote:
> > > > > On 7 Nov 2020, at 9:03, trondmy@xxxxxxxxxx wrote:
> > > > > 
> > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> > > > > > 
> > > > > > If the directory is changing, causing the page cache to get
> > > > > > invalidated
> > > > > > while we are listing the contents, then the NFS client is
> > > > > > currently
> > > > > > forced
> > > > > > to read in the entire directory contents from scratch,
> > > > > > because
> > > > > > it
> > > > > > needs
> > > > > > to perform a linear search for the readdir cookie. While
> > > > > > this
> > > > > > is
> > > > > > not
> > > > > > an issue for small directories, it does not scale to
> > > > > > directories
> > > > > > with
> > > > > > millions of entries.
> > > > > > In order to be able to deal with large directories that are
> > > > > > changing,
> > > > > > add a heuristic to ensure that if the page cache is empty,
> > > > > > and
> > > > > > we
> > > > > > are
> > > > > > searching for a cookie that is not the zero cookie, we just
> > > > > > default
> > > > > > to
> > > > > > performing uncached readdir.
> > > > > > 
> > > > > > Signed-off-by: Trond Myklebust
> > > > > > <trond.myklebust@xxxxxxxxxxxxxxx>
> > > > > > ---
> > > > > >  fs/nfs/dir.c | 17 +++++++++++++++++
> > > > > >  1 file changed, 17 insertions(+)
> > > > > > 
> > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > index 238872d116f7..d7a9efd31ecd 100644
> > > > > > --- a/fs/nfs/dir.c
> > > > > > +++ b/fs/nfs/dir.c
> > > > > > @@ -917,11 +917,28 @@ static int
> > > > > > find_and_lock_cache_page(struct
> > > > > > nfs_readdir_descriptor *desc)
> > > > > >         return res;
> > > > > >  }
> > > > > > 
> > > > > > +static bool nfs_readdir_dont_search_cache(struct
> > > > > > nfs_readdir_descriptor *desc)
> > > > > > +{
> > > > > > +       struct address_space *mapping = desc->file-
> > > > > > >f_mapping;
> > > > > > +       struct inode *dir = file_inode(desc->file);
> > > > > > +       unsigned int dtsize = NFS_SERVER(dir)->dtsize;
> > > > > > +       loff_t size = i_size_read(dir);
> > > > > > +
> > > > > > +       /*
> > > > > > +        * Default to uncached readdir if the page cache is
> > > > > > empty,
> > > > > > and
> > > > > > +        * we're looking for a non-zero cookie in a large
> > > > > > directory.
> > > > > > +        */
> > > > > > +       return desc->dir_cookie != 0 && mapping->nrpages ==
> > > > > > 0
> > > > > > &&
> > > > > > size >
> > > > > > dtsize;
> > > > > 
> > > > > inode size > dtsize is a little hand-wavy.  We have a lot of
> > > > > customers
> > > > > trying to
> > > > > reverse-engineer nfs_readdir() behavior instead of reading
> > > > > the
> > > > > code,
> > > > > this
> > > > > is sure to drive them crazy.
> > > > > 
> > > > > That said, in the absence of an easy way to make it tunable,
> > > > > I
> > > > > don't
> > > > > have
> > > > > anything better to suggest.
> > > > > 
> > > > > Reviewed-by: Benjamin Coddington <bcodding@xxxxxxxxxx>
> > > > 
> > > > 
> > > > Right. It is a heuristic, but I would expect that the directory
> > > > size is
> > > > going to be somewhat proportional to the number of RPC calls we
> > > > need to
> > > > perform to read it. That number again is somewhat proportional
> > > > to
> > > > the
> > > > dtsize.
> > > > 
> > > > IOW: The general idea is correct.
> > > 
> > > I can agree with that, but I have another thought:
> > > 
> > > If the point of the heuristic is to allow a full listing to
> > > eventually
> > > complete, it should not be dependent on mapping->nrpages == 0. 
> > > Otherwise,
> > > other processes can start filling the cache and we're back to the
> > > situation
> > > where filling the cache could take longer than acdirmax, and
> > > things
> > > eventually congest to a halt.
> > > 
> > > Flipping a bit on the context to remain uncached gives a better
> > > assurance we
> > > can continue to make forward progress.
> > 
> > I disagree. The point of the page cache is to allow sharing of
> > information between processes where possible. If there are multiple
> > processes all trying to make progress, and one of them starts
> > filling
> > the page cache from scratch, then why should we not use that?
> 
> Because the process that starts filling the pagecache from scratch
> then
> enjoins the process that may be nearly finished listing the directory
> to
> start over waiting for the page cache to be filled (or help fill it).
> 
> If the time taken to get to a certain offset/cookie exceeds the time
> to
> cache the directory's attributes, we'll drop the pagecache, or if
> we're
> perhaps using READDIRPLUS with many entries, we'll saturate the
> memory on
> the machine and start to reclaim it before we can ever finish.  There
> are
> scenarios where forward progress becomes very slow.
> 
> Perhaps the onus is on me to whip up an example - I will do that.
> 
> > The alternative is not scaling to multiple processes.
> 
> The next process that comes along filling the pagecache will benefit
> the
> next processes, and so on, until a page is evicted or the cache is
> lost..
> etc.  The pagecache is still useful to multiple processes.
> 
> > > It's too bad we're stuck caching entries linearly.  What
> > > challenges
> > > might
> > > exist if we tried to use an XArray to map directory position to
> > > cookie?  I
> > > imagine we could implement this in a single XArray by using both
> > > position
> > > and cookie values as indices, and differentiate between them
> > > using
> > > two of
> > > the three XA marks, and store a structure to represent both. 
> > > Also
> > > unclear
> > > would be how to handle the lifetime of the XArray, since we'd no
> > > longer be
> > > using the VMs pagecache management..
> > > 
> > 
> > You might be able to speed up first cookie lookup by having an
> > Xarray
> > that maps from a 64-bit cookie to a nfs_cache_array_entry which
> > contains the next cookie to look up. However that would only work
> > on
> > 64-bit systems since xarrays take an unsigned long index.
> 
> Yes, but I would like to allow processes to cache entries non-
> linearly.

You still have to play them back in linear fashion. If you're using the
cookie as a lookup key, then it would require you to look up entries 1
at a time (i.e. look up cookie to find new entry with a cookie that
points to the next entry to be looked up). It would be slow...

> 
> > Furthermore, you still need a way to map offsets to entries for the
> > case where we're not able to use cookies for lseek() purposes.
> > That's a
> > linear search through the directory, which would be horrible with
> > an
> > xarray of linked cookie values (so you'd probably need a second
> > xarray
> > for that?).
> 
> There's xa_for_each_marked(), but it may not perform - I haven't
> looked
> at the implementation or tested it.

That looks up the directory in cookie order, not in the directory
order.

IOW: it might work OK for XFS, which appears to use ordered cookies,
but it will break badly with ext4, which uses hashed cookies.

> 
> > Construction and teardown of that structure would be nasty for
> > large
> > directories, since you have as many cookies as you have entries in
> > your
> > directory. IOW: You'd have to tear down 127 times as many xarray
> > entries as we have now.
> > 
> > It is not obvious that we would be able to benefit from starting at
> > an
> > arbitrary location and caching that data, since if the directory
> > changed, we'd have to read in the new data anyway.
> 
> The only case where it seems obvious is for the case where a very
> long
> listing is about to complete, and then the pagecache is invalidated,
> and
> then that plays out over and over again.  This is the pain point for
> our
> customers that are migrating NFS workloads onto slower (more latent)
> cloud infrastructure.

In testing, I found that the current patchset performed just fine
w.r.t. the readdir count. The reason why I wasn't seeing huge
performance increases when doing an 'rm -rf', for instance, was due to
there being 2 GETATTRs and 1 LOOKUP per unlink() call.

> 
> > Memory management would need to be implemented somehow. You'd need
> > a
> > shrinker for this tree that could intelligently prune it.
> 
> nod.. thanks for your thoughts on this.
> 
> Ben
> 

-- 
Trond Myklebust
CTO, Hammerspace Inc
4984 El Camino Real, Suite 208
Los Altos, CA 94022

www.hammer.space