On Fri, 2022-02-25 at 09:44 -0500, Benjamin Coddington wrote: > On 25 Feb 2022, at 8:10, Trond Myklebust wrote: > > > On Fri, 2022-02-25 at 06:38 -0500, Benjamin Coddington wrote: > > > On 24 Feb 2022, at 22:51, Trond Myklebust wrote: > > > > > > > On Fri, 2022-02-25 at 02:26 +0000, Trond Myklebust wrote: > > > > > On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote: > > > > > > On 23 Feb 2022, at 16:12, trondmy@xxxxxxxxxx wrote: > > > > > > > > > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > > > > > > > > Use the change attribute and the first cookie in a > > > > > > > directory > > > > > > > page > > > > > > > cache > > > > > > > entry to validate that the page is up to date. > > > > > > > > > > > > > > Suggested-by: Benjamin Coddington <bcodding@xxxxxxxxxx> > > > > > > > Signed-off-by: Trond Myklebust > > > > > > > <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > --- > > > > > > > fs/nfs/dir.c | 68 > > > > > > > ++++++++++++++++++++++++++++------------------------ > > > > > > > 1 file changed, 37 insertions(+), 31 deletions(-) > > > > > > > > > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > > > > > > > index f2258e926df2..5d9367d9b651 100644 > > > > > > > --- a/fs/nfs/dir.c > > > > > > > +++ b/fs/nfs/dir.c > > > > > > > @@ -139,6 +139,7 @@ struct nfs_cache_array_entry { > > > > > > > }; > > > > > > > > > > > > > > struct nfs_cache_array { > > > > > > > + u64 change_attr; > > > > > > > u64 last_cookie; > > > > > > > unsigned int size; > > > > > > > unsigned char page_full : 1, > > > > > > > @@ -175,7 +176,8 @@ static void > > > > > > > nfs_readdir_array_init(struct > > > > > > > nfs_cache_array *array) > > > > > > > memset(array, 0, sizeof(struct nfs_cache_array)); > > > > > > > } > > > > > > > > > > > > > > -static void nfs_readdir_page_init_array(struct page > > > > > > > *page, > > > > > > > u64 > > > > > > > last_cookie) > > > > > > > +static void nfs_readdir_page_init_array(struct page > > > > > > > *page, > > > > > > > u64 > > > > > > > last_cookie, > > > > > > > + u64 > > > > > > > change_attr) > > > > > > > { > > > > > > > struct nfs_cache_array *array; > > > > > > > > > > > > > > > > > > There's a hunk missing here, something like: > > > > > > > > > > > > @@ -185,6 +185,7 @@ static void > > > > > > nfs_readdir_page_init_array(struct > > > > > > page > > > > > > *page, u64 last_cookie, > > > > > > nfs_readdir_array_init(array); > > > > > > array->last_cookie = last_cookie; > > > > > > array->cookies_are_ordered = 1; > > > > > > + array->change_attr = change_attr; > > > > > > kunmap_atomic(array); > > > > > > } > > > > > > > > > > > > > > > > > > > > @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64 > > > > > > > last_cookie, > > > > > > > gfp_t gfp_flags) > > > > > > > { > > > > > > > struct page *page = alloc_page(gfp_flags); > > > > > > > if (page) > > > > > > > - nfs_readdir_page_init_array(page, > > > > > > > last_cookie); > > > > > > > + nfs_readdir_page_init_array(page, > > > > > > > last_cookie, > > > > > > > 0); > > > > > > > return page; > > > > > > > } > > > > > > > > > > > > > > @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct > > > > > > > nfs_entry > > > > > > > *entry, struct page *page) > > > > > > > return ret; > > > > > > > } > > > > > > > > > > > > > > +static bool nfs_readdir_page_cookie_match(struct page > > > > > > > *page, > > > > > > > u64 > > > > > > > last_cookie, > > > > > > > + > > > > > > > u64 change_attr) > > > > > > > > > > > > How about "nfs_readdir_page_valid()"? There's more going > > > > > > on > > > > > > than a > > > > > > cookie match. > > > > > > > > > > > > > > > > > > > +{ > > > > > > > + struct nfs_cache_array *array = > > > > > > > kmap_atomic(page); > > > > > > > + int ret = true; > > > > > > > + > > > > > > > + if (array->change_attr != change_attr) > > > > > > > + ret = false; > > > > > > > > > > > > Can we skip the next test if ret = false? > > > > > > > > > > I'd expect the compiler to do that. > > > > > > > > > > > > > > > > > > + if (array->size > 0 && array->array[0].cookie != > > > > > > > last_cookie) > > > > > > > + ret = false; > > > > > > > + kunmap_atomic(array); > > > > > > > + return ret; > > > > > > > +} > > > > > > > + > > > > > > > +static void nfs_readdir_page_unlock_and_put(struct page > > > > > > > *page) > > > > > > > +{ > > > > > > > + unlock_page(page); > > > > > > > + put_page(page); > > > > > > > +} > > > > > > > + > > > > > > > static struct page *nfs_readdir_page_get_locked(struct > > > > > > > address_space > > > > > > > *mapping, > > > > > > > pgoff_t > > > > > > > index, > > > > > > > u64 > > > > > > > last_cookie) > > > > > > > { > > > > > > > struct page *page; > > > > > > > + u64 change_attr; > > > > > > > > > > > > > > page = grab_cache_page(mapping, index); > > > > > > > - if (page && !PageUptodate(page)) { > > > > > > > - nfs_readdir_page_init_array(page, > > > > > > > last_cookie); > > > > > > > - if > > > > > > > (invalidate_inode_pages2_range(mapping, index > > > > > > > + > > > > > > > 1, -1) < 0) > > > > > > > - nfs_zap_mapping(mapping->host, > > > > > > > mapping); > > > > > > > - SetPageUptodate(page); > > > > > > > + if (!page) > > > > > > > + return NULL; > > > > > > > + change_attr = > > > > > > > inode_peek_iversion_raw(mapping->host); > > > > > > > + if (PageUptodate(page)) { > > > > > > > + if > > > > > > > (nfs_readdir_page_cookie_match(page, > > > > > > > last_cookie, > > > > > > > + > > > > > > > change_attr)) > > > > > > > + return page; > > > > > > > + nfs_readdir_clear_array(page); > > > > > > > > > > > > > > > > > > Why use i_version rather than nfs_save_change_attribute? > > > > > > Seems > > > > > > having a > > > > > > consistent value across the pachecache and dir_verifiers > > > > > > would > > > > > > help > > > > > > debugging, and we've already have a bunch of machinery > > > > > > around > > > > > > the > > > > > > change_attribute. > > > > > > > > > > The directory cache_change_attribute is not reported in > > > > > tracepoints > > > > > because it is a directory-specific field, so it's not as > > > > > useful > > > > > for > > > > > debugging. > > > > > > > > > > The inode change attribute is what we have traditionally used > > > > > for > > > > > determining cache consistency, and when to invalidate the > > > > > cache. > > > > > > > > I should probably elaborate a little further on the differences > > > > between > > > > the inode change attribute and the cache_change_attribute. > > > > > > > > One of the main reasons for introducing the latter was to have > > > > something that allows us to track changes to the directory, but > > > > to > > > > avoid forcing unnecessary revalidations of the dcache. > > > > > > > > What this means is that when we create or remove a file, and > > > > the > > > > pre/post-op attributes tell us that there were no third party > > > > changes > > > > to the directory, we update the dcache, but we do _not_ update > > > > the > > > > cache_change_attribute, because we know that the rest of the > > > > directory > > > > contents are valid, and so we don't have to revalidate the > > > > dentries. > > > > However in that case, we _do_ want to update the readdir cache > > > > to > > > > reflect the fact that an entry was added or deleted. While we > > > > could > > > > figure out how to remove an entry (at least for the case where > > > > the > > > > filesystem is case-sensitive), we do not know where the > > > > filesystem > > > > added the new file, or what cookies was assigned. > > > > > > > > This is why the inode change attribute is more appropriate for > > > > indexing > > > > the page cache pages. It reflects the cases where we want to > > > > revalidate > > > > the readdir cache, as opposed to the dcache. > > > > > > Ok, thanks for explaining this. > > > > > > I've noticed that you haven't responded about my concerns about > > > not > > > checking > > > the directory for changes with every v4 READDIR. For v3, we have > > > post-op > > > updates to the directory, but with v4 the directory can change > > > and > > > we'll > > > end up with entries in the cache that are marked with an old > > > change_attr. > > > > > > > Then they will be rejected by nfs_readdir_page_cookie_match() if a > > user > > looks up that page again after we've revalidated the change > > attribute > > on the directory. > > > > ...and note that NFSv4 does returns a struct change_info4 for all > > operations that change the directory, so we will update the change > > attribute in all those cases. > > I'm not worried about changes from the same client. > > > If the change is made on the server, well then we will detect it > > through the standard revalidation process that usually decides when > > to > > invalidate the directory page cache. > > The environments I'm concerned about are setup very frequently: they > look > like multiple NFS clients co-ordinating on a directory with millions > of > files. Some clients are adding files as they do work, other clients > are > then looking for those files by walking the directory entries to > validate > their existence. The systems that do this have a "very bad time" if > some > of them produce listings that are _dramatically_ and transiently > different > from a listing they produced before. > > That can happen really easily with what we've got here, and it can > create a > huge problem for these setups. And it won't be easily reproduceable, > and it > will be hard to find. It will cost everyone involved a lot of time > and > effort to track down, and we can fix it easily. > > > > I'm pretty positive that not checking for changes to the > > > directory > > > (not > > > sending GETATTR with READDIR) is going to create cases of double- > > > listed > > > and > > > truncated-listings for dirctory listers. Not handling those > > > cases > > > means > > > I'm > > > going to have some very unhappy customers that complain about > > > their > > > files > > > disappearing/reappearing on NFS. > > > > > > If you need me to prove that its an issue, I can take the time to > > > write > > > up > > > program that shows this problem. > > > > > > > If you label the page contents with an attribute that was retrieved > > _after_ the READDIR op, then you will introduce this as a problem > > for > > your customers. > > No the problem is already here, we're not introducing it. By > labeling > the > page contents with every call we're shifting the race window from the > client > where it's a very large window to the server where the window is > small. > > Its still possible, but *much* less likely. > > > The reason is that there is no atomicity between operations in a > > COMPOUND. Worse, the implementation of readdir in scalable modern > > systems, including Linux, does not even guarantee atomicity of the > > readdir operation itself. Instead each readdir entry is filled > > without > > holding any locks or preventing any changes to the directory or to > > the > > object itself. > > I understand all this, but its not a reason to make the problem > worse. > > > POSIX states very explicitly that if you're making changes to the > > directory after the call to opendir() or rewinddir(), then the > > behaviour w.r.t. whether that file appears in the readdir() call is > > unspecified. See > > https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html > > Yes, but again - just because the problem exists doesn't give us > reason > to > amplify it when we can easily make a better choice for almost no > cost. > > Here are my reasons for wanting the GETATTR added: > - it makes it *much* less likely for this problem to occur, with > the > minor > downside of decreased caching for unstable directories. > - it makes v3 and v4 readdir pagecache behavior consistent WRT > changing > directories. > > I spent a non-trivial amount of time working on this problem, and saw > this > exact issue appear. Its definitely something that's going to come > back > and > bite us if we don't fix it. > > How can I convince you? I've offered to produce a working example of > this > problem. Will you review those results? If I cannot convince you, I > feel > I'll have to pursue distro-specific changes for this work. Ben, the main cause of this kind of issue in the current code is the following line: /* * ctx->pos points to the dirent entry number. * *desc->dir_cookie has the cookie for the next entry. We have * to either find the entry with the appropriate number or * revalidate the cookie. */ if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) { res = nfs_revalidate_mapping(inode, file->f_mapping); if (res < 0) goto out; } That line protects the page cache against changes aften opendir(). It was introduced by Red Hat in commmit 07b5ce8ef2d8 in order to fix a claim of a severe performance problem. These patches _remove_ that protection, because we're now able to cope with more frequent revalidation without needing to restart directory reads from scratch. So no. Without further proof, I don't accept your claim that this patchset introduces a regression. I don't accept your claim that we are required to revalidate the change attribute on every readdir call. We can't do that for NFSv2 or NFSv3 (the latter offers a post_op attribute, not pre-op attribute) and as I already pointed out, there is nothing in POSIX that requires this. If you want to fork the Red Hat kernel over it, then that's your decision. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx