Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2022-02-25 at 09:44 -0500, Benjamin Coddington wrote:
> On 25 Feb 2022, at 8:10, Trond Myklebust wrote:
> 
> > On Fri, 2022-02-25 at 06:38 -0500, Benjamin Coddington wrote:
> > > On 24 Feb 2022, at 22:51, Trond Myklebust wrote:
> > > 
> > > > On Fri, 2022-02-25 at 02:26 +0000, Trond Myklebust wrote:
> > > > > On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote:
> > > > > > On 23 Feb 2022, at 16:12, trondmy@xxxxxxxxxx wrote:
> > > > > > 
> > > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> > > > > > > 
> > > > > > > Use the change attribute and the first cookie in a
> > > > > > > directory
> > > > > > > page
> > > > > > > cache
> > > > > > > entry to validate that the page is up to date.
> > > > > > > 
> > > > > > > Suggested-by: Benjamin Coddington <bcodding@xxxxxxxxxx>
> > > > > > > Signed-off-by: Trond Myklebust
> > > > > > > <trond.myklebust@xxxxxxxxxxxxxxx>
> > > > > > > ---
> > > > > > >  fs/nfs/dir.c | 68
> > > > > > > ++++++++++++++++++++++++++++------------------------
> > > > > > >  1 file changed, 37 insertions(+), 31 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > > index f2258e926df2..5d9367d9b651 100644
> > > > > > > --- a/fs/nfs/dir.c
> > > > > > > +++ b/fs/nfs/dir.c
> > > > > > > @@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
> > > > > > >  };
> > > > > > > 
> > > > > > >  struct nfs_cache_array {
> > > > > > > +       u64 change_attr;
> > > > > > >         u64 last_cookie;
> > > > > > >         unsigned int size;
> > > > > > >         unsigned char page_full : 1,
> > > > > > > @@ -175,7 +176,8 @@ static void
> > > > > > > nfs_readdir_array_init(struct
> > > > > > > nfs_cache_array *array)
> > > > > > >         memset(array, 0, sizeof(struct nfs_cache_array));
> > > > > > >  }
> > > > > > > 
> > > > > > > -static void nfs_readdir_page_init_array(struct page
> > > > > > > *page,
> > > > > > > u64
> > > > > > > last_cookie)
> > > > > > > +static void nfs_readdir_page_init_array(struct page
> > > > > > > *page,
> > > > > > > u64
> > > > > > > last_cookie,
> > > > > > > +                                       u64
> > > > > > > change_attr)
> > > > > > >  {
> > > > > > >         struct nfs_cache_array *array;
> > > > > > 
> > > > > > 
> > > > > > There's a hunk missing here, something like:
> > > > > > 
> > > > > > @@ -185,6 +185,7 @@ static void
> > > > > > nfs_readdir_page_init_array(struct
> > > > > > page
> > > > > > *page, u64 last_cookie,
> > > > > >          nfs_readdir_array_init(array);
> > > > > >          array->last_cookie = last_cookie;
> > > > > >          array->cookies_are_ordered = 1;
> > > > > > +       array->change_attr = change_attr;
> > > > > >          kunmap_atomic(array);
> > > > > >   }
> > > > > > 
> > > > > > > 
> > > > > > > @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64
> > > > > > > last_cookie,
> > > > > > > gfp_t gfp_flags)
> > > > > > >  {
> > > > > > >         struct page *page = alloc_page(gfp_flags);
> > > > > > >         if (page)
> > > > > > > -               nfs_readdir_page_init_array(page,
> > > > > > > last_cookie);
> > > > > > > +               nfs_readdir_page_init_array(page,
> > > > > > > last_cookie,
> > > > > > > 0);
> > > > > > >         return page;
> > > > > > >  }
> > > > > > > 
> > > > > > > @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct
> > > > > > > nfs_entry
> > > > > > > *entry, struct page *page)
> > > > > > >         return ret;
> > > > > > >  }
> > > > > > > 
> > > > > > > +static bool nfs_readdir_page_cookie_match(struct page
> > > > > > > *page,
> > > > > > > u64
> > > > > > > last_cookie,
> > > > > > > +                                        
> > > > > > > u64 change_attr)
> > > > > > 
> > > > > > How about "nfs_readdir_page_valid()"?  There's more going
> > > > > > on
> > > > > > than a
> > > > > > cookie match.
> > > > > > 
> > > > > > 
> > > > > > > +{
> > > > > > > +       struct nfs_cache_array *array =
> > > > > > > kmap_atomic(page);
> > > > > > > +       int ret = true;
> > > > > > > +
> > > > > > > +       if (array->change_attr != change_attr)
> > > > > > > +               ret = false;
> > > > > > 
> > > > > > Can we skip the next test if ret = false?
> > > > > 
> > > > > I'd expect the compiler to do that.
> > > > > 
> > > > > > 
> > > > > > > +       if (array->size > 0 && array->array[0].cookie !=
> > > > > > > last_cookie)
> > > > > > > +               ret = false;
> > > > > > > +       kunmap_atomic(array);
> > > > > > > +       return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void nfs_readdir_page_unlock_and_put(struct page
> > > > > > > *page)
> > > > > > > +{
> > > > > > > +       unlock_page(page);
> > > > > > > +       put_page(page);
> > > > > > > +}
> > > > > > > +
> > > > > > >  static struct page *nfs_readdir_page_get_locked(struct
> > > > > > > address_space
> > > > > > > *mapping,
> > > > > > >                                                 pgoff_t
> > > > > > > index,
> > > > > > > u64
> > > > > > > last_cookie)
> > > > > > >  {
> > > > > > >         struct page *page;
> > > > > > > +       u64 change_attr;
> > > > > > > 
> > > > > > >         page = grab_cache_page(mapping, index);
> > > > > > > -       if (page && !PageUptodate(page)) {
> > > > > > > -               nfs_readdir_page_init_array(page,
> > > > > > > last_cookie);
> > > > > > > -               if
> > > > > > > (invalidate_inode_pages2_range(mapping, index
> > > > > > > +
> > > > > > > 1, -1) < 0)
> > > > > > > -                       nfs_zap_mapping(mapping->host,
> > > > > > > mapping);
> > > > > > > -               SetPageUptodate(page);
> > > > > > > +       if (!page)
> > > > > > > +               return NULL;
> > > > > > > +       change_attr =
> > > > > > > inode_peek_iversion_raw(mapping->host);
> > > > > > > +       if (PageUptodate(page)) {
> > > > > > > +               if
> > > > > > > (nfs_readdir_page_cookie_match(page,
> > > > > > > last_cookie,
> > > > > > > +                                                
> > > > > > > change_attr))
> > > > > > > +                       return page;
> > > > > > > +               nfs_readdir_clear_array(page);
> > > > > > 
> > > > > > 
> > > > > > Why use i_version rather than nfs_save_change_attribute? 
> > > > > > Seems
> > > > > > having a
> > > > > > consistent value across the pachecache and dir_verifiers
> > > > > > would
> > > > > > help
> > > > > > debugging, and we've already have a bunch of machinery
> > > > > > around
> > > > > > the
> > > > > > change_attribute.
> > > > > 
> > > > > The directory cache_change_attribute is not reported in
> > > > > tracepoints
> > > > > because it is a directory-specific field, so it's not as
> > > > > useful
> > > > > for
> > > > > debugging.
> > > > > 
> > > > > The inode change attribute is what we have traditionally used
> > > > > for
> > > > > determining cache consistency, and when to invalidate the
> > > > > cache.
> > > > 
> > > > I should probably elaborate a little further on the differences
> > > > between
> > > > the inode change attribute and the cache_change_attribute.
> > > > 
> > > > One of the main reasons for introducing the latter was to have
> > > > something that allows us to track changes to the directory, but
> > > > to
> > > > avoid forcing unnecessary revalidations of the dcache.
> > > > 
> > > > What this means is that when we create or remove a file, and
> > > > the
> > > > pre/post-op attributes tell us that there were no third party
> > > > changes
> > > > to the directory, we update the dcache, but we do _not_ update
> > > > the
> > > > cache_change_attribute, because we know that the rest of the
> > > > directory
> > > > contents are valid, and so we don't have to revalidate the
> > > > dentries.
> > > > However in that case, we _do_ want to update the readdir cache
> > > > to
> > > > reflect the fact that an entry was added or deleted. While we
> > > > could
> > > > figure out how to remove an entry (at least for the case where
> > > > the
> > > > filesystem is case-sensitive), we do not know where the
> > > > filesystem
> > > > added the new file, or what cookies was assigned.
> > > > 
> > > > This is why the inode change attribute is more appropriate for
> > > > indexing
> > > > the page cache pages. It reflects the cases where we want to
> > > > revalidate
> > > > the readdir cache, as opposed to the dcache.
> > > 
> > > Ok, thanks for explaining this.
> > > 
> > > I've noticed that you haven't responded about my concerns about
> > > not
> > > checking
> > > the directory for changes with every v4 READDIR.  For v3, we have
> > > post-op
> > > updates to the directory, but with v4 the directory can change
> > > and
> > > we'll
> > > end up with entries in the cache that are marked with an old
> > > change_attr.
> > > 
> > 
> > Then they will be rejected by nfs_readdir_page_cookie_match() if a 
> > user
> > looks up that page again after we've revalidated the change
> > attribute
> > on the directory.
> > 
> > ...and note that NFSv4 does returns a struct change_info4 for all
> > operations that change the directory, so we will update the change
> > attribute in all those cases.
> 
> I'm not worried about changes from the same client.
> 
> > If the change is made on the server, well then we will detect it
> > through the standard revalidation process that usually decides when
> > to
> > invalidate the directory page cache.
> 
> The environments I'm concerned about are setup very frequently: they 
> look
> like multiple NFS clients co-ordinating on a directory with millions
> of
> files.  Some clients are adding files as they do work, other clients
> are
> then looking for those files by walking the directory entries to 
> validate
> their existence.  The systems that do this have a "very bad time" if 
> some
> of them produce listings that are _dramatically_ and transiently 
> different
> from a listing they produced before.
> 
> That can happen really easily with what we've got here, and it can 
> create a
> huge problem for these setups.  And it won't be easily reproduceable,
> and it
> will be hard to find.  It will cost everyone involved a lot of time
> and
> effort to track down, and we can fix it easily.
> 
> > > I'm pretty positive that not checking for changes to the
> > > directory
> > > (not
> > > sending GETATTR with READDIR) is going to create cases of double-
> > > listed
> > > and
> > > truncated-listings for dirctory listers.  Not handling those
> > > cases
> > > means
> > > I'm
> > > going to have some very unhappy customers that complain about
> > > their
> > > files
> > > disappearing/reappearing on NFS.
> > > 
> > > If you need me to prove that its an issue, I can take the time to
> > > write
> > > up
> > > program that shows this problem.
> > > 
> > 
> > If you label the page contents with an attribute that was retrieved
> > _after_ the READDIR op, then you will introduce this as a problem
> > for
> > your customers.
> 
> No the problem is already here, we're not introducing it.  By
> labeling 
> the
> page contents with every call we're shifting the race window from the
> client
> where it's a very large window to the server where the window is
> small.
> 
> Its still possible, but *much* less likely.
> 
> > The reason is that there is no atomicity between operations in a
> > COMPOUND. Worse, the implementation of readdir in scalable modern
> > systems, including Linux, does not even guarantee atomicity of the
> > readdir operation itself. Instead each readdir entry is filled
> > without
> > holding any locks or preventing any changes to the directory or to
> > the
> > object itself.
> 
> I understand all this, but its not a reason to make the problem
> worse.
> 
> > POSIX states very explicitly that if you're making changes to the
> > directory after the call to opendir() or rewinddir(), then the
> > behaviour w.r.t. whether that file appears in the readdir() call is
> > unspecified. See
> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
> 
> Yes, but again - just because the problem exists doesn't give us
> reason 
> to
> amplify it when we can easily make a better choice for almost no
> cost.
> 
> Here are my reasons for wanting the GETATTR added:
>   - it makes it *much* less likely for this problem to occur, with
> the 
> minor
>     downside of decreased caching for unstable directories.
>   - it makes v3 and v4 readdir pagecache behavior consistent WRT 
> changing
>     directories.
> 
> I spent a non-trivial amount of time working on this problem, and saw
> this
> exact issue appear.  Its definitely something that's going to come
> back 
> and
> bite us if we don't fix it.
> 
> How can I convince you?  I've offered to produce a working example of
> this
> problem.  Will you review those results?  If I cannot convince you, I
> feel
> I'll have to pursue distro-specific changes for this work.

Ben, the main cause of this kind of issue in the current code is the
following line:

        /*
         * ctx->pos points to the dirent entry number.
         * *desc->dir_cookie has the cookie for the next entry. We have
         * to either find the entry with the appropriate number or
         * revalidate the cookie.
         */
        if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
                res = nfs_revalidate_mapping(inode, file->f_mapping);
                if (res < 0)
                        goto out;
        }


That line protects the page cache against changes aften opendir(). It
was introduced by Red Hat in commmit 07b5ce8ef2d8 in order to fix a
claim of a severe performance problem.

These patches _remove_ that protection, because we're now able to cope
with more frequent revalidation without needing to restart directory
reads from scratch.

So no. Without further proof, I don't accept your claim that this
patchset introduces a regression. I don't accept your claim that we are
required to revalidate the change attribute on every readdir call. We
can't do that for NFSv2 or NFSv3 (the latter offers a post_op
attribute, not pre-op attribute) and as I already pointed out, there is
nothing in POSIX that requires this.

If you want to fork the Red Hat kernel over it, then that's your
decision.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx






[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux