Re: [PATCH v5 00/22] Readdir enhancements

David Wysochanski <dwysocha@xxxxxxxxxx> · Thu, 12 Nov 2020 06:41:41 -0500

On Wed, Nov 11, 2020 at 5:15 PM David Wysochanski <dwysocha@xxxxxxxxxx> wrote:
>
> On Tue, Nov 10, 2020 at 4:48 PM <trondmy@xxxxxxxxxx> wrote:
> >
> > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> >
> > The following patch series performs a number of cleanups on the readdir
> > code.
> > It also adds support for 1MB readdir RPC calls on-the-wire, and modifies
> > the caching code to ensure that we cache the entire contents of that
> > 1MB call (instead of discarding the data that doesn't fit into a single
> > page).
> > For filesystems that use ordered readdir cookie schemes (e.g. XFS), it
> > optimises searching for cookies in the client's page cache by skipping
> > over pages that contain cookie values that are not in the range we are
> > searching for.
> > Finally, it improves scalability when dealing with very large
> > directories by turning off caching when those directories are changing,
> > so as to avoid the need for a linear search on the client of the entire
> > directory when looking for the first entry pointed to by the current
> > file descriptor offset.
> >
> > v2: Fix the handling of the NFSv3/v4 directory verifier.
> > v3: Optimise searching when the readdir cookies are seen to be ordered.
> > v4: Optimise performance for large directories that are changing.
> >     Add in llseek dependency patches.
> > v5: Integrate Olga's patch for the READDIR security label handling.
> >     Record more entries in the uncached readdir case. Bump the max
> >     number of pages to 512, but allocate them on demand in case the
> >     readdir RPC call returns fewer entries.
> >
> > Olga Kornievskaia (1):
> >   NFSv4.2: condition READDIR's mask for security label based on LSM
> >     state
> >
> > Trond Myklebust (21):
> >   NFS: Remove unnecessary inode locking in nfs_llseek_dir()
> >   NFS: Remove unnecessary inode lock in nfs_fsync_dir()
> >   NFS: Ensure contents of struct nfs_open_dir_context are consistent
> >   NFS: Clean up readdir struct nfs_cache_array
> >   NFS: Clean up nfs_readdir_page_filler()
> >   NFS: Clean up directory array handling
> >   NFS: Don't discard readdir results
> >   NFS: Remove unnecessary kmap in nfs_readdir_xdr_to_array()
> >   NFS: Replace kmap() with kmap_atomic() in nfs_readdir_search_array()
> >   NFS: Simplify struct nfs_cache_array_entry
> >   NFS: Support larger readdir buffers
> >   NFS: More readdir cleanups
> >   NFS: nfs_do_filldir() does not return a value
> >   NFS: Reduce readdir stack usage
> >   NFS: Cleanup to remove nfs_readdir_descriptor_t typedef
> >   NFS: Allow the NFS generic code to pass in a verifier to readdir
> >   NFS: Handle NFS4ERR_NOT_SAME and NFSERR_BADCOOKIE from readdir calls
> >   NFS: Improve handling of directory verifiers
> >   NFS: Optimisations for monotonically increasing readdir cookies
> >   NFS: Reduce number of RPC calls when doing uncached readdir
> >   NFS: Do uncached readdir when we're seeking a cookie in an empty page
> >     cache
> >
> >  fs/nfs/client.c         |   4 +-
> >  fs/nfs/dir.c            | 734 +++++++++++++++++++++++++---------------
> >  fs/nfs/inode.c          |   7 -
> >  fs/nfs/internal.h       |   6 -
> >  fs/nfs/nfs3proc.c       |  35 +-
> >  fs/nfs/nfs4proc.c       |  48 +--
> >  fs/nfs/proc.c           |  18 +-
> >  include/linux/nfs_fs.h  |   9 +-
> >  include/linux/nfs_xdr.h |  17 +-
> >  9 files changed, 541 insertions(+), 337 deletions(-)
> >
> > --
> > 2.28.0
> >
>
> I wrote a test script and ran this patchset against 5.10-rc2 in a
> variety of scenarios listing 1 million files and various directory
> modification scenarios. The test checked ops, runtime and cookie
> resets (via tcpdump).
>
> All runs performed very well on all scenarios I threw at it on all NFS
> versions (I tested 3, 4.0, 4.1, and 4.2) with fairly dramatic
> improvements vs the same runs on 5.10-rc2.   One scenario I found
> where a single 'ls -l' could take a long time (about 5 minutes, but
> not unbounded time) was when the directory was being modified with
> both file adds and removed, but this seemed due to other ops unrelated
> to readdir performance.   A second scenario that this patchset does
> not fix is the scenario where the directory is large enough to exceed
> the acdirmax, is being modified (adds and removes), and _multiple_
> processes are listing it.  In this scenario a lister can still have
> unbounded time, but the existing readdir code also has this problem
> and fixing that probably requires another patch / approach (such as
> reworking the structure of the cache or an approach such as the flag
> on per-process dir context).
>

I think I am mistaken in my previous email that said multiple listers
will not make forward progress.  After more testing I could not
reproduce any indefinite wait but it was many minutes, and I think
there is proof in the current code (patch 22) that an indefinite wait
should not occur with any process (multiple processes does not change
this).  Note that I did not trace this, but below is an attempt at a
more detailed explanation from the code.  The summary of the argument
is that periodically nfs_readdir_dont_search_cache() will be true for
one iteration of nfs_readdir(), and this ensures forward progress,
even if it may be slow.

Some notation for the below explanation.
Tu: Time to list unmodified (idle) directory
Ax: acdirmax
Assume Tu > acdirmax (time to list idle directory is larger than
acdirmax) and i_size_read(dir) > NFS_SERVER(dir)->dtsize
PL1: pid1 that is listing the directory
DC1, DC2 and DC3: values of nfs_open_dir_context.dir_cookie for PL1
PL1_dir_cookie: the current value of PL1's nfs_open_dir_context.dir_cookie

Then consider the following timeline, with "Tn" various points on a
timeline of PL1 listing a very large directory.

T0: PL1 starts listing directory with repeated calls to nfs_readdir()
T1: acdirmax is exceeded, and we drop the pagecache (mapping->nrpages
== 0); at this point, PL1_dir_cookie = DC1 != 0
T2: PL1 calls nfs_readdir and nfs_readdir_dont_search_cache() returns
true (due to mapping->nrpages == 0), so enters uncached_readdir();
thus, PL1_dir_cookie makes forward progress so PL1_dir_cookie is ahead
of DC1 and the exit of nfs_readdir()
T3: PL1 calls nfs_readdir() and nfs_readdir_dont_search_cache() is
false, so it must restart filling the pagecache from cookie == 0 (we
send READDIRs over the wire); at this point PL1_dir_cookie forward
progress is stalled, and PL1's calls to nfs_readdir() will have to
fetch entries and fill pages in the pagecache that does not pertain to
its listing (they've already been returned to userspace in a previous
nfs_readdir call)
T4: acdirmax is exceeded, and we drop the pagecache again, setting
mapping->nrpages == 0; at this point though, PL1_dir_cookie = DC2,
where DC2 is ahead of DC1 (we made forward progress last time)
T5: PL1 calls nfs_readdir and nfs_readdir_dont_search_cache() returns
true, so enters uncached_readdir(); thus, PL1_dir_cookie makes forward
progress again for this one call to nfs_readdir(), and PL1_dir_cookie
= DC3 is ahead of DC2

Thus, it seems PL1 will eventually complete, even if it is delayed a
bit, and even if it has to re-fill pages in the page cache that is
"extra work" that doesn't help PL1.  Forward progress is guaranteed
due to the periodic calls to nfs_readdir() when
nfs_readdir_dont_search_cache() returns true due to mapping->nrpages
== 0 portion of the condition inside nfs_readdir_dont_search_cache()
+static bool nfs_readdir_dont_search_cache(struct nfs_readdir_descriptor *desc)
+{
+       struct address_space *mapping = desc->file->f_mapping;
+       struct inode *dir = file_inode(desc->file);
+       unsigned int dtsize = NFS_SERVER(dir)->dtsize;
+       loff_t size = i_size_read(dir);
+
+       /*
+        * Default to uncached readdir if the page cache is empty, and
+        * we're looking for a non-zero cookie in a large directory.
+        */
+       return desc->dir_cookie != 0 && mapping->nrpages == 0 && size > dtsize;
+}

> I think these patches make NFS readdir dramatically better, and they
> fix a lot of underlying issues with the existing code, like the cookie
> resets when pagecache is dropped, single page rx data problem, etc.
> Thank you for doing these patches.
>
> Tested-by: Dave Wysochanski