On 25 Feb 2022, at 8:10, Trond Myklebust wrote:
On Fri, 2022-02-25 at 06:38 -0500, Benjamin Coddington wrote:
On 24 Feb 2022, at 22:51, Trond Myklebust wrote:
On Fri, 2022-02-25 at 02:26 +0000, Trond Myklebust wrote:
On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote:
On 23 Feb 2022, at 16:12, trondmy@xxxxxxxxxx wrote:
From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
Use the change attribute and the first cookie in a directory
page
cache
entry to validate that the page is up to date.
Suggested-by: Benjamin Coddington <bcodding@xxxxxxxxxx>
Signed-off-by: Trond Myklebust
<trond.myklebust@xxxxxxxxxxxxxxx>
---
fs/nfs/dir.c | 68
++++++++++++++++++++++++++++------------------------
1 file changed, 37 insertions(+), 31 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index f2258e926df2..5d9367d9b651 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
};
struct nfs_cache_array {
+ u64 change_attr;
u64 last_cookie;
unsigned int size;
unsigned char page_full : 1,
@@ -175,7 +176,8 @@ static void nfs_readdir_array_init(struct
nfs_cache_array *array)
memset(array, 0, sizeof(struct nfs_cache_array));
}
-static void nfs_readdir_page_init_array(struct page *page,
u64
last_cookie)
+static void nfs_readdir_page_init_array(struct page *page,
u64
last_cookie,
+ u64
change_attr)
{
struct nfs_cache_array *array;
There's a hunk missing here, something like:
@@ -185,6 +185,7 @@ static void
nfs_readdir_page_init_array(struct
page
*page, u64 last_cookie,
nfs_readdir_array_init(array);
array->last_cookie = last_cookie;
array->cookies_are_ordered = 1;
+ array->change_attr = change_attr;
kunmap_atomic(array);
}
@@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64
last_cookie,
gfp_t gfp_flags)
{
struct page *page = alloc_page(gfp_flags);
if (page)
- nfs_readdir_page_init_array(page,
last_cookie);
+ nfs_readdir_page_init_array(page,
last_cookie,
0);
return page;
}
@@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct
nfs_entry
*entry, struct page *page)
return ret;
}
+static bool nfs_readdir_page_cookie_match(struct page *page,
u64
last_cookie,
+
u64 change_attr)
How about "nfs_readdir_page_valid()"? There's more going on
than a
cookie match.
+{
+ struct nfs_cache_array *array = kmap_atomic(page);
+ int ret = true;
+
+ if (array->change_attr != change_attr)
+ ret = false;
Can we skip the next test if ret = false?
I'd expect the compiler to do that.
+ if (array->size > 0 && array->array[0].cookie !=
last_cookie)
+ ret = false;
+ kunmap_atomic(array);
+ return ret;
+}
+
+static void nfs_readdir_page_unlock_and_put(struct page
*page)
+{
+ unlock_page(page);
+ put_page(page);
+}
+
static struct page *nfs_readdir_page_get_locked(struct
address_space
*mapping,
pgoff_t
index,
u64
last_cookie)
{
struct page *page;
+ u64 change_attr;
page = grab_cache_page(mapping, index);
- if (page && !PageUptodate(page)) {
- nfs_readdir_page_init_array(page,
last_cookie);
- if
(invalidate_inode_pages2_range(mapping, index
+
1, -1) < 0)
- nfs_zap_mapping(mapping->host,
mapping);
- SetPageUptodate(page);
+ if (!page)
+ return NULL;
+ change_attr =
inode_peek_iversion_raw(mapping->host);
+ if (PageUptodate(page)) {
+ if
(nfs_readdir_page_cookie_match(page,
last_cookie,
+
change_attr))
+ return page;
+ nfs_readdir_clear_array(page);
Why use i_version rather than nfs_save_change_attribute? Seems
having a
consistent value across the pachecache and dir_verifiers would
help
debugging, and we've already have a bunch of machinery around
the
change_attribute.
The directory cache_change_attribute is not reported in
tracepoints
because it is a directory-specific field, so it's not as useful
for
debugging.
The inode change attribute is what we have traditionally used for
determining cache consistency, and when to invalidate the cache.
I should probably elaborate a little further on the differences
between
the inode change attribute and the cache_change_attribute.
One of the main reasons for introducing the latter was to have
something that allows us to track changes to the directory, but to
avoid forcing unnecessary revalidations of the dcache.
What this means is that when we create or remove a file, and the
pre/post-op attributes tell us that there were no third party
changes
to the directory, we update the dcache, but we do _not_ update the
cache_change_attribute, because we know that the rest of the
directory
contents are valid, and so we don't have to revalidate the
dentries.
However in that case, we _do_ want to update the readdir cache to
reflect the fact that an entry was added or deleted. While we could
figure out how to remove an entry (at least for the case where the
filesystem is case-sensitive), we do not know where the filesystem
added the new file, or what cookies was assigned.
This is why the inode change attribute is more appropriate for
indexing
the page cache pages. It reflects the cases where we want to
revalidate
the readdir cache, as opposed to the dcache.
Ok, thanks for explaining this.
I've noticed that you haven't responded about my concerns about not
checking
the directory for changes with every v4 READDIR. For v3, we have
post-op
updates to the directory, but with v4 the directory can change and
we'll
end up with entries in the cache that are marked with an old
change_attr.
Then they will be rejected by nfs_readdir_page_cookie_match() if a
user
looks up that page again after we've revalidated the change attribute
on the directory.
...and note that NFSv4 does returns a struct change_info4 for all
operations that change the directory, so we will update the change
attribute in all those cases.
I'm not worried about changes from the same client.
If the change is made on the server, well then we will detect it
through the standard revalidation process that usually decides when to
invalidate the directory page cache.
The environments I'm concerned about are setup very frequently: they
look
like multiple NFS clients co-ordinating on a directory with millions of
files. Some clients are adding files as they do work, other clients are
then looking for those files by walking the directory entries to
validate
their existence. The systems that do this have a "very bad time" if
some
of them produce listings that are _dramatically_ and transiently
different
from a listing they produced before.
That can happen really easily with what we've got here, and it can
create a
huge problem for these setups. And it won't be easily reproduceable,
and it
will be hard to find. It will cost everyone involved a lot of time and
effort to track down, and we can fix it easily.
I'm pretty positive that not checking for changes to the directory
(not
sending GETATTR with READDIR) is going to create cases of double-
listed
and
truncated-listings for dirctory listers. Not handling those cases
means
I'm
going to have some very unhappy customers that complain about their
files
disappearing/reappearing on NFS.
If you need me to prove that its an issue, I can take the time to
write
up
program that shows this problem.
If you label the page contents with an attribute that was retrieved
_after_ the READDIR op, then you will introduce this as a problem for
your customers.
No the problem is already here, we're not introducing it. By labeling
the
page contents with every call we're shifting the race window from the
client
where it's a very large window to the server where the window is small.
Its still possible, but *much* less likely.
The reason is that there is no atomicity between operations in a
COMPOUND. Worse, the implementation of readdir in scalable modern
systems, including Linux, does not even guarantee atomicity of the
readdir operation itself. Instead each readdir entry is filled without
holding any locks or preventing any changes to the directory or to the
object itself.
I understand all this, but its not a reason to make the problem worse.
POSIX states very explicitly that if you're making changes to the
directory after the call to opendir() or rewinddir(), then the
behaviour w.r.t. whether that file appears in the readdir() call is
unspecified. See
https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
Yes, but again - just because the problem exists doesn't give us reason
to
amplify it when we can easily make a better choice for almost no cost.
Here are my reasons for wanting the GETATTR added:
- it makes it *much* less likely for this problem to occur, with the
minor
downside of decreased caching for unstable directories.
- it makes v3 and v4 readdir pagecache behavior consistent WRT
changing
directories.
I spent a non-trivial amount of time working on this problem, and saw
this
exact issue appear. Its definitely something that's going to come back
and
bite us if we don't fix it.
How can I convince you? I've offered to produce a working example of
this
problem. Will you review those results? If I cannot convince you, I
feel
I'll have to pursue distro-specific changes for this work.
Ben