On Mon, 2020-11-02 at 11:26 -0500, David Wysochanski wrote: > On Mon, Nov 2, 2020 at 11:22 AM Trond Myklebust < > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Mon, 2020-11-02 at 08:50 -0500, Dave Wysochanski wrote: > > > A process can hang forever to 'ls -l' a directory while the > > > directory > > > is being modified such as another NFS client adding files to the > > > directory. The problem is seen specifically with larger > > > directories > > > (I tested with 1 million) and/or slower NFS server responses to > > > READDIR. If a combination of the NFS directory size, the NFS > > > server > > > responses to READDIR is such that the 'ls' process gets partially > > > through the listing before the attribute cache expires (time > > > exceeds acdirmax), we drop the pagecache and have to re-fill it, > > > and as a result, the process may never complete. One could argue > > > for larger directories the acdirmin/acdirmax should be increased, > > > but it's not always possible to tune this effectively. > > > > > > The root cause of this problem is due to how the NFS readdir > > > cache > > > currently works. The main search function, > > > readdir_search_pagecache(), > > > always starts searching at page_index and cookie == 0, and for > > > any > > > page not in the cache, fills in the page with entries obtained in > > > a READDIR NFS call. If a page already exists, we proceed to > > > nfs_readdir_search_for_cookie(), which searches for the cookie > > > (pos) of the readdir call. The search is O(n), where n is the > > > directory size before the cookie in question is found, and every > > > entry to nfs_readdir() pays this penalty, irrespective of the > > > current directory position (dir_context.pos). The search is > > > expensive due to the opaque nature of readdir cookies, and the > > > fact > > > that no mapping (hash) exists from cookies to pages. In the case > > > of a directory being modified, the above behavior can become an > > > excessive penalty, since the same process is forced to fill pages > > > it > > > may be no longer interested in (the entries were passed in a > > > previous > > > nfs_readdir call), and this can essentially lead no forward > > > progress. > > > > > > To fix this problem, at the end of nfs_readdir(), save the > > > page_index > > > corresponding to the directory position (cookie) inside the > > > process's > > > nfs_open_dir_context. Then at the next entry of nfs_readdir(), > > > use > > > the saved page_index as the starting search point rather than > > > starting > > > at page_index == 0. Not only does this fix the problem of > > > listing > > > a directory being modified, it also significantly improves > > > performance > > > in the unmodified case since no extra search penalty is paid at > > > each > > > entry to nfs_readdir(). > > > > > > In the case of lseek, since there is no hash or other mapping > > > from a > > > cookie value to the page->index, just reset > > > nfs_open_dir_context.page_index > > > to 0, which will reset the search to the old behavior. > > > > > > Signed-off-by: Dave Wysochanski <dwysocha@xxxxxxxxxx> > > > --- > > > fs/nfs/dir.c | 8 +++++++- > > > include/linux/nfs_fs.h | 1 + > > > 2 files changed, 8 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > > > index 52e06c8fc7cd..b266f505b521 100644 > > > --- a/fs/nfs/dir.c > > > +++ b/fs/nfs/dir.c > > > @@ -78,6 +78,7 @@ static struct nfs_open_dir_context > > > *alloc_nfs_open_dir_context(struct inode *dir > > > ctx->attr_gencount = nfsi->attr_gencount; > > > ctx->dir_cookie = 0; > > > ctx->dup_cookie = 0; > > > + ctx->page_index = 0; > > > ctx->cred = get_cred(cred); > > > spin_lock(&dir->i_lock); > > > if (list_empty(&nfsi->open_files) && > > > @@ -763,7 +764,7 @@ int > > > find_and_lock_cache_page(nfs_readdir_descriptor_t *desc) > > > return res; > > > } > > > > > > -/* Search for desc->dir_cookie from the beginning of the page > > > cache > > > */ > > > +/* Search for desc->dir_cookie starting at desc->page_index */ > > > static inline > > > int readdir_search_pagecache(nfs_readdir_descriptor_t *desc) > > > { > > > @@ -885,6 +886,8 @@ static int nfs_readdir(struct file *file, > > > struct > > > dir_context *ctx) > > > .ctx = ctx, > > > .dir_cookie = &dir_ctx->dir_cookie, > > > .plus = nfs_use_readdirplus(inode, ctx), > > > + .page_index = dir_ctx->page_index, > > > + .last_cookie = nfs_readdir_use_cookie(file) ? > > > ctx- > > > > pos : 0, > > > }, > > > *desc = &my_desc; > > > int res = 0; > > > @@ -938,6 +941,7 @@ static int nfs_readdir(struct file *file, > > > struct > > > dir_context *ctx) > > > out: > > > if (res > 0) > > > res = 0; > > > + dir_ctx->page_index = desc->page_index; > > > trace_nfs_readdir_exit(inode, ctx->pos, dir_ctx- > > > >dir_cookie, > > > NFS_SERVER(inode)->dtsize, > > > my_desc.plus, res); > > > return res; > > > @@ -975,6 +979,8 @@ static loff_t nfs_llseek_dir(struct file > > > *filp, > > > loff_t offset, int whence) > > > else > > > dir_ctx->dir_cookie = 0; > > > dir_ctx->duped = 0; > > > + /* Force readdir_search_pagecache to start over > > > */ > > > + dir_ctx->page_index = 0; > > > } > > > inode_unlock(inode); > > > return offset; > > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h > > > index a2c6455ea3fa..0e55c0154ccd 100644 > > > --- a/include/linux/nfs_fs.h > > > +++ b/include/linux/nfs_fs.h > > > @@ -93,6 +93,7 @@ struct nfs_open_dir_context { > > > __u64 dir_cookie; > > > __u64 dup_cookie; > > > signed char duped; > > > + unsigned long page_index; > > > }; > > > > > > /* > > > > NACK. It makes no sense to store the page index as a cursor. > > > > A similar thing was done recently with: > 227823d2074d nfs: optimise readdir cache page invalidation > That's a very different thing. It is about discarding page data in order to force a re-read of the contents into cache. What you're doing is basically trying to guess where the data is located. which might work in some cases where the directory is completely static, but if it shrinks (e.g. due to a few unlink() or rename() calls) so that you overshoot the cookie, then you can end up reading all the way to the end of the directory before doing an uncached readdir. IOW: This will have a detrimental effect for some workloads, which needs to be weighed up against the benefits. I saw that you've tested with large directories, but what workloads were you testing on those directories? -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx