On Wed, 2020-11-11 at 14:53 -0500, Benjamin Coddington wrote: > On 11 Nov 2020, at 12:34, Trond Myklebust wrote: > > > On Wed, 2020-11-11 at 11:43 -0500, Benjamin Coddington wrote: > > > On 9 Nov 2020, at 16:46, Trond Myklebust wrote: > > > > > > > On Mon, 2020-11-09 at 16:41 -0500, Benjamin Coddington wrote: > > > > > On 7 Nov 2020, at 9:03, trondmy@xxxxxxxxxx wrote: > > > > > > > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > > > > > > If the directory is changing, causing the page cache to get > > > > > > invalidated > > > > > > while we are listing the contents, then the NFS client is > > > > > > currently > > > > > > forced > > > > > > to read in the entire directory contents from scratch, > > > > > > because > > > > > > it > > > > > > needs > > > > > > to perform a linear search for the readdir cookie. While > > > > > > this > > > > > > is > > > > > > not > > > > > > an issue for small directories, it does not scale to > > > > > > directories > > > > > > with > > > > > > millions of entries. > > > > > > In order to be able to deal with large directories that are > > > > > > changing, > > > > > > add a heuristic to ensure that if the page cache is empty, > > > > > > and > > > > > > we > > > > > > are > > > > > > searching for a cookie that is not the zero cookie, we just > > > > > > default > > > > > > to > > > > > > performing uncached readdir. > > > > > > > > > > > > Signed-off-by: Trond Myklebust > > > > > > <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > --- > > > > > > fs/nfs/dir.c | 17 +++++++++++++++++ > > > > > > 1 file changed, 17 insertions(+) > > > > > > > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > > > > > > index 238872d116f7..d7a9efd31ecd 100644 > > > > > > --- a/fs/nfs/dir.c > > > > > > +++ b/fs/nfs/dir.c > > > > > > @@ -917,11 +917,28 @@ static int > > > > > > find_and_lock_cache_page(struct > > > > > > nfs_readdir_descriptor *desc) > > > > > > return res; > > > > > > } > > > > > > > > > > > > +static bool nfs_readdir_dont_search_cache(struct > > > > > > nfs_readdir_descriptor *desc) > > > > > > +{ > > > > > > + struct address_space *mapping = desc->file- > > > > > > >f_mapping; > > > > > > + struct inode *dir = file_inode(desc->file); > > > > > > + unsigned int dtsize = NFS_SERVER(dir)->dtsize; > > > > > > + loff_t size = i_size_read(dir); > > > > > > + > > > > > > + /* > > > > > > + * Default to uncached readdir if the page cache is > > > > > > empty, > > > > > > and > > > > > > + * we're looking for a non-zero cookie in a large > > > > > > directory. > > > > > > + */ > > > > > > + return desc->dir_cookie != 0 && mapping->nrpages == > > > > > > 0 > > > > > > && > > > > > > size > > > > > > > dtsize; > > > > > > > > > > inode size > dtsize is a little hand-wavy. We have a lot of > > > > > customers > > > > > trying to > > > > > reverse-engineer nfs_readdir() behavior instead of reading > > > > > the > > > > > code, > > > > > this > > > > > is sure to drive them crazy. > > > > > > > > > > That said, in the absence of an easy way to make it tunable, > > > > > I > > > > > don't > > > > > have > > > > > anything better to suggest. > > > > > > > > > > Reviewed-by: Benjamin Coddington <bcodding@xxxxxxxxxx> > > > > > > > > > > > > Right. It is a heuristic, but I would expect that the directory > > > > size is > > > > going to be somewhat proportional to the number of RPC calls we > > > > need to > > > > perform to read it. That number again is somewhat proportional > > > > to > > > > the > > > > dtsize. > > > > > > > > IOW: The general idea is correct. > > > > > > I can agree with that, but I have another thought: > > > > > > If the point of the heuristic is to allow a full listing to > > > eventually > > > complete, it should not be dependent on mapping->nrpages == 0. > > > Otherwise, > > > other processes can start filling the cache and we're back to the > > > situation > > > where filling the cache could take longer than acdirmax, and > > > things > > > eventually congest to a halt. > > > > > > Flipping a bit on the context to remain uncached gives a better > > > assurance we > > > can continue to make forward progress. > > > > I disagree. The point of the page cache is to allow sharing of > > information between processes where possible. If there are multiple > > processes all trying to make progress, and one of them starts > > filling > > the page cache from scratch, then why should we not use that? > > Because the process that starts filling the pagecache from scratch > then > enjoins the process that may be nearly finished listing the directory > to > start over waiting for the page cache to be filled (or help fill it). > > If the time taken to get to a certain offset/cookie exceeds the time > to > cache the directory's attributes, we'll drop the pagecache, or if > we're > perhaps using READDIRPLUS with many entries, we'll saturate the > memory on > the machine and start to reclaim it before we can ever finish. There > are > scenarios where forward progress becomes very slow. > > Perhaps the onus is on me to whip up an example - I will do that. > > > The alternative is not scaling to multiple processes. > > The next process that comes along filling the pagecache will benefit > the > next processes, and so on, until a page is evicted or the cache is > lost.. > etc. The pagecache is still useful to multiple processes. > > > > It's too bad we're stuck caching entries linearly. What > > > challenges > > > might > > > exist if we tried to use an XArray to map directory position to > > > cookie? I > > > imagine we could implement this in a single XArray by using both > > > position > > > and cookie values as indices, and differentiate between them > > > using > > > two of > > > the three XA marks, and store a structure to represent both. > > > Also > > > unclear > > > would be how to handle the lifetime of the XArray, since we'd no > > > longer be > > > using the VMs pagecache management.. > > > > > > > You might be able to speed up first cookie lookup by having an > > Xarray > > that maps from a 64-bit cookie to a nfs_cache_array_entry which > > contains the next cookie to look up. However that would only work > > on > > 64-bit systems since xarrays take an unsigned long index. > > Yes, but I would like to allow processes to cache entries non- > linearly. You still have to play them back in linear fashion. If you're using the cookie as a lookup key, then it would require you to look up entries 1 at a time (i.e. look up cookie to find new entry with a cookie that points to the next entry to be looked up). It would be slow... > > > Furthermore, you still need a way to map offsets to entries for the > > case where we're not able to use cookies for lseek() purposes. > > That's a > > linear search through the directory, which would be horrible with > > an > > xarray of linked cookie values (so you'd probably need a second > > xarray > > for that?). > > There's xa_for_each_marked(), but it may not perform - I haven't > looked > at the implementation or tested it. That looks up the directory in cookie order, not in the directory order. IOW: it might work OK for XFS, which appears to use ordered cookies, but it will break badly with ext4, which uses hashed cookies. > > > Construction and teardown of that structure would be nasty for > > large > > directories, since you have as many cookies as you have entries in > > your > > directory. IOW: You'd have to tear down 127 times as many xarray > > entries as we have now. > > > > It is not obvious that we would be able to benefit from starting at > > an > > arbitrary location and caching that data, since if the directory > > changed, we'd have to read in the new data anyway. > > The only case where it seems obvious is for the case where a very > long > listing is about to complete, and then the pagecache is invalidated, > and > then that plays out over and over again. This is the pain point for > our > customers that are migrating NFS workloads onto slower (more latent) > cloud infrastructure. In testing, I found that the current patchset performed just fine w.r.t. the readdir count. The reason why I wasn't seeing huge performance increases when doing an 'rm -rf', for instance, was due to there being 2 GETATTRs and 1 LOOKUP per unlink() call. > > > Memory management would need to be implemented somehow. You'd need > > a > > shrinker for this tree that could intelligently prune it. > > nod.. thanks for your thoughts on this. > > Ben > -- Trond Myklebust CTO, Hammerspace Inc 4984 El Camino Real, Suite 208 Los Altos, CA 94022 www.hammer.space