Re: [PATCH v3] NFSv3: handle out-of-order write replies.

Anna Schumaker <anna@xxxxxxxxxx> · Wed, 15 Feb 2023 14:58:49 -0500

Hi Neil,

On Wed, Feb 15, 2023 at 12:13 AM NeilBrown <neilb@xxxxxxx> wrote:
>
>
> NFSv3 includes pre/post wcc attributes which allow the client to
> determine if all changes to the file have been made by the client
> itself, or if any might have been made by some other client.
>
> If there are gaps in the pre/post ctime sequence it must be assumed that
> some other client changed the file in that gap and the local cache must
> be suspect.  The next time the file is opened the cache should be
> invalidated.
>
> Since Commit 1c341b777501 ("NFS: Add deferred cache invalidation for
> close-to-open consistency violations") in linux 5.3 the Linux client has
> been triggering this invalidation.  The chunk in nfs_update_inode() in
> particularly triggers.
>
> Unfortunately Linux NFS assumes that all replies will be processed in
> the order sent, and will arrive in the order processed.  This is not
> true in general.  Consequently Linux NFS might ignore the wcc info in a
> WRITE reply because the reply is in response to a WRITE that was sent
> before some other request for which a reply has already been seen.  This
> is detected by Linux using the gencount tests in nfs_inode_attr_cmp().
>
> Also, when the gencount tests pass it is still possible that the request
> were processed on the server in a different order, and a gap seen in
> the ctime sequence might be filled in by a subsequent reply, so gaps
> should not immediately trigger delayed invalidation.
>
> The net result is that writing to a server and then reading the file
> back can result in going to the server for the read rather than serving
> it from cache - all because a couple of replies arrived out-of-order.
> This is a performance regression over kernels before 5.3, though the
> change in 5.3 is a correctness improvement.
>
> This has been seen with Linux writing to a Netapp server which
> occasionally re-orders requests.  In testing the majority of requests
> were in-order, but a few (maybe 2 or three at a time) could be
> re-ordered.
>
> This patch addresses the problem by recording any gaps seen in the
> pre/post ctime sequence and not triggering invalidation until either
> there are too many gaps to fit in the table, or until there are no more
> active writes and the remaining gaps cannot be resolved.
>
> We allocate a table of 16 gaps on demand.  If the allocation fails we
> revert to current behaviour which is of little cost as we are unlikely
> to be able to cache the writes anyway.
>
> In the table we store "start->end" pair when iversion is updated and
> "end<-start" pairs pre/post pairs reported by the server.  Usually these
> exactly cancel out and so nothing is stored.  When there are
> out-of-order replies we do store gaps and these will eventually be
> cancelled against later replies when this client is the only writer.
>
> If the final write is out-of-order there may be one gap remaining when
> the file is closed.  This will be noticed and if there is precisely on
> gap and if the iversion can be advanced to match it, then we do so.
>
> This patch makes no attempt to handle directories correctly.  The same
> problem potentially exists in the out-of-order replies to create/unlink
> requests can cause future lookup requires to be sent to the server
> unnecessarily.  A similar scheme using the same primitives could be used
> to notice and handle out-of-order replies.
>
> Signed-off-by: NeilBrown <neilb@xxxxxxx>
> ---
>  fs/nfs/inode.c         | 120 +++++++++++++++++++++++++++++++++++------
>  include/linux/nfs_fs.h |  47 ++++++++++++++++
>  2 files changed, 152 insertions(+), 15 deletions(-)
>
> Sorry for the new version so soon :-)
>
> This version removes the PRE_CHANGE_RAW flag - I realised that before
> clearing PRE_CHANGE I can simply record the pre/post info if relevant.
>
> I also now understand the ways in which directories are different and so
> make no attempt to address directories.  Change comment explains that
> there is room for improvement there, but it is a separate issue needing
> separate testing.
>
> Thanks,
> NeilBrown
>
>
>
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index e98ee7599eeb..8f1a78837ec9 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -208,11 +208,12 @@ void nfs_set_cache_invalid(struct inode *inode, unsigned long flags)
>
>         nfsi->cache_validity |= flags;
>
> -       if (inode->i_mapping->nrpages == 0)
> -               nfsi->cache_validity &= ~(NFS_INO_INVALID_DATA |
> -                                         NFS_INO_DATA_INVAL_DEFER);
> -       else if (nfsi->cache_validity & NFS_INO_INVALID_DATA)
> -               nfsi->cache_validity &= ~NFS_INO_DATA_INVAL_DEFER;
> +       if (inode->i_mapping->nrpages == 0) {
> +               nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
> +               nfs_ooo_clear(nfsi);
> +       } else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> +               nfs_ooo_clear(nfsi);
> +       }
>         trace_nfs_set_cache_invalid(inode, 0);
>  }
>  EXPORT_SYMBOL_GPL(nfs_set_cache_invalid);
> @@ -677,9 +678,10 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
>         trace_nfs_size_truncate(inode, offset);
>         i_size_write(inode, offset);
>         /* Optimisation */
> -       if (offset == 0)
> -               NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_DATA |
> -                               NFS_INO_DATA_INVAL_DEFER);
> +       if (offset == 0) {
> +               NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_DATA;
> +               nfs_ooo_clear(NFS_I(inode));
> +       }
>         NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_SIZE;
>
>         spin_unlock(&inode->i_lock);
> @@ -1101,7 +1103,7 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx)
>
>         spin_lock(&inode->i_lock);
>         if (list_empty(&nfsi->open_files) &&
> -           (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
> +           nfs_ooo_test(nfsi))
>                 nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA |
>                                                      NFS_INO_REVAL_FORCED);
>         list_add_tail_rcu(&ctx->list, &nfsi->open_files);
> @@ -1345,8 +1347,8 @@ int nfs_clear_invalid_mapping(struct address_space *mapping)
>
>         set_bit(NFS_INO_INVALIDATING, bitlock);
>         smp_wmb();
> -       nfsi->cache_validity &=
> -               ~(NFS_INO_INVALID_DATA | NFS_INO_DATA_INVAL_DEFER);
> +       nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
> +       nfs_ooo_clear(nfsi);
>         spin_unlock(&inode->i_lock);
>         trace_nfs_invalidate_mapping_enter(inode);
>         ret = nfs_invalidate_mapping(inode, mapping);
> @@ -1808,6 +1810,74 @@ static int nfs_inode_finish_partial_attr_update(const struct nfs_fattr *fattr,
>         return 0;
>  }
>
> +static void nfs_ooo_merge(struct nfs_inode *nfsi,
> +                         u64 start, u64 end)
> +{
> +       int i, cnt;
> +
> +       if (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)
> +               /* No point merging anything */
> +               return;
> +       if (!S_ISREG(nfsi->vfs_inode.mode))

Um, should this be "vfs_inode.i_mode"? I'm also curious how you've
tested this, since the code won't compile without this change?

Thanks,
Anna

> +               /* We flush cached data for a directory on any change
> +                * because we cannot insert new entries into the cache,
> +                * or delete old entries.
> +                * So there is no point being concerned about out-of-order
> +                * replies.
> +                */
> +               return;
> +
> +       if (!nfsi->ooo) {
> +               nfsi->ooo = kmalloc(sizeof(*nfsi->ooo), GFP_ATOMIC);
> +               if (!nfsi->ooo) {
> +                       nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
> +                       return;
> +               }
> +               nfsi->ooo->cnt = 0;
> +       }
> +
> +       /* add this range, merging if possible */
> +       cnt = nfsi->ooo->cnt;
> +       for (i = 0; i < cnt; i++) {
> +               if (end == nfsi->ooo->gap[i].start)
> +                       end = nfsi->ooo->gap[i].end;
> +               else if (start == nfsi->ooo->gap[i].end)
> +                       start = nfsi->ooo->gap[i].start;
> +               else
> +                       continue;
> +               /* Remove 'i' from table and loop to insert the new range */
> +               cnt -= 1;
> +               nfsi->ooo->gap[i] = nfsi->ooo->gap[cnt];
> +               i = -1;
> +       }
> +       if (start != end) {
> +               if (cnt >= ARRAY_SIZE(nfsi->ooo->gap)) {
> +                       nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
> +                       kfree(nfsi->ooo);
> +                       nfsi->ooo = NULL;
> +                       return;
> +               }
> +               nfsi->ooo->gap[cnt].start = start;
> +               nfsi->ooo->gap[cnt].end = end;
> +               cnt += 1;
> +       }
> +       nfsi->ooo->cnt = cnt;
> +}
> +
> +static void nfs_ooo_record(struct nfs_inode *nfsi,
> +                          struct nfs_fattr *fattr)
> +{
> +       /* This reply was out-of-order, so record in the
> +        * pre/post change id, possibly cancelling
> +        * gaps created when iversion was jumpped forward.
> +        */
> +       if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) &&
> +           (fattr->valid & NFS_ATTR_FATTR_PRECHANGE))
> +               nfs_ooo_merge(nfsi,
> +                             fattr->change_attr,
> +                             fattr->pre_change_attr);
> +}
> +
>  static int nfs_refresh_inode_locked(struct inode *inode,
>                                     struct nfs_fattr *fattr)
>  {
> @@ -1818,8 +1888,12 @@ static int nfs_refresh_inode_locked(struct inode *inode,
>
>         if (attr_cmp > 0 || nfs_inode_finish_partial_attr_update(fattr, inode))
>                 ret = nfs_update_inode(inode, fattr);
> -       else if (attr_cmp == 0)
> -               ret = nfs_check_inode_attributes(inode, fattr);
> +       else {
> +               nfs_ooo_record(NFS_I(inode), fattr);
> +
> +               if (attr_cmp == 0)
> +                       ret = nfs_check_inode_attributes(inode, fattr);
> +       }
>
>         trace_nfs_refresh_inode_exit(inode, ret);
>         return ret;
> @@ -1910,6 +1984,8 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa
>         if (attr_cmp < 0)
>                 return 0;
>         if ((fattr->valid & NFS_ATTR_FATTR) == 0 || !attr_cmp) {
> +               /* Record the pre/post change info before clearing PRECHANGE */
> +               nfs_ooo_record(NFS_I(inode), fattr);
>                 fattr->valid &= ~(NFS_ATTR_FATTR_PRECHANGE
>                                 | NFS_ATTR_FATTR_PRESIZE
>                                 | NFS_ATTR_FATTR_PREMTIME
> @@ -2064,6 +2140,15 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
>
>         /* More cache consistency checks */
>         if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
> +               if (!have_writers && nfsi->ooo && nfsi->ooo->cnt == 1 &&
> +                   nfsi->ooo->gap[0].end == inode_peek_iversion_raw(inode)) {
> +                       /* There is one remaining gap that hasn't been
> +                        * merged into iversion - do that now.
> +                        */
> +                       inode_set_iversion_raw(inode, nfsi->ooo->gap[0].start);
> +                       kfree(nfsi->ooo);
> +                       nfsi->ooo = NULL;
> +               }
>                 if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
>                         /* Could it be a race with writeback? */
>                         if (!(have_writers || have_delegation)) {
> @@ -2085,8 +2170,11 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
>                                 dprintk("NFS: change_attr change on server for file %s/%ld\n",
>                                                 inode->i_sb->s_id,
>                                                 inode->i_ino);
> -                       } else if (!have_delegation)
> -                               nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
> +                       } else if (!have_delegation) {
> +                               nfs_ooo_record(nfsi, fattr);
> +                               nfs_ooo_merge(nfsi, inode_peek_iversion_raw(inode),
> +                                             fattr->change_attr);
> +                       }
>                         inode_set_iversion_raw(inode, fattr->change_attr);
>                 }
>         } else {
> @@ -2240,6 +2328,7 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
>                 return NULL;
>         nfsi->flags = 0UL;
>         nfsi->cache_validity = 0UL;
> +       nfsi->ooo = NULL;
>  #if IS_ENABLED(CONFIG_NFS_V4)
>         nfsi->nfs4_acl = NULL;
>  #endif /* CONFIG_NFS_V4 */
> @@ -2252,6 +2341,7 @@ EXPORT_SYMBOL_GPL(nfs_alloc_inode);
>
>  void nfs_free_inode(struct inode *inode)
>  {
> +       kfree(NFS_I(inode)->ooo);
>         kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
>  }
>  EXPORT_SYMBOL_GPL(nfs_free_inode);
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index d92fdfd2444c..776afe09b63b 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -191,6 +191,39 @@ struct nfs_inode {
>         /* Open contexts for shared mmap writes */
>         struct list_head        open_files;
>
> +       /* Keep track of out-of-order replies.
> +        * The ooo array contains start/end pairs of
> +        * numbers from the changeid sequence when
> +        * the inode's iversion has been updated.
> +        * It also contains end/start pair (i.e. reverse order)
> +        * of sections of the changeid sequence that have
> +        * been seen in replies from the server.
> +        * Normally these should match and when both
> +        * A:B and B:A are found in ooo, they are both removed.
> +        * And if a reply with A:B causes an iversion update
> +        * of A:B, then neither are added.
> +        * When a reply has pre_change that doesn't match
> +        * iversion, then the changeid pair and any consequent
> +        * change in iversion ARE added.  Later replies
> +        * might fill in the gaps, or possibly a gap is caused
> +        * by a change from another client.
> +        * When a file or directory is opened, if the ooo table
> +        * is not empty, then we assume the gaps were due to
> +        * another client and we invalidate the cached data.
> +        *
> +        * We can only track a limited number of concurrent gaps.
> +        * Currently that limit is 16.
> +        * We allocate the table on demand.  If there is insufficient
> +        * memory, then we probably cannot cache the file anyway
> +        * so there is no loss.
> +        */
> +       struct {
> +               int cnt;
> +               struct {
> +                       u64 start, end;
> +               } gap[16];
> +       } *ooo;
> +
>  #if IS_ENABLED(CONFIG_NFS_V4)
>         struct nfs4_cached_acl  *nfs4_acl;
>          /* NFSv4 state */
> @@ -616,6 +649,20 @@ nfs_fileid_to_ino_t(u64 fileid)
>         return ino;
>  }
>
> +static inline void nfs_ooo_clear(struct nfs_inode *nfsi)
> +{
> +       nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
> +       kfree(nfsi->ooo);
> +       nfsi->ooo = NULL;
> +}
> +
> +static inline bool nfs_ooo_test(struct nfs_inode *nfsi)
> +{
> +       return (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER) ||
> +               (nfsi->ooo && nfsi->ooo->cnt > 0);
> +
> +}
> +
>  #define NFS_JUKEBOX_RETRY_TIME (5 * HZ)
>
>  /* We need to block new opens while a file is being unlinked.
> --
> 2.39.1
>
>