Re: [PATCH v9 3/5] NFS: Convert buffered read paths to use netfs when fscache is enabled

David Wysochanski <dwysocha@xxxxxxxxxx> · Sat, 29 Oct 2022 12:46:29 -0400




On Fri, Oct 28, 2022 at 12:59 PM Trond Myklebust <trondmy@xxxxxxxxxx> wrote:
>
> On Fri, 2022-10-28 at 07:50 -0400, David Wysochanski wrote:
> > On Thu, Oct 27, 2022 at 3:16 PM Trond Myklebust <trondmy@xxxxxxxxxx>
> > wrote:
> > >
> > > On Mon, 2022-10-17 at 06:52 -0400, Dave Wysochanski wrote:
> > > > Convert the NFS buffered read code paths to corresponding netfs
> > > > APIs,
> > > > but only when fscache is configured and enabled.
> > > >
> > > > The netfs API defines struct netfs_request_ops which must be
> > > > filled
> > > > in by the network filesystem.  For NFS, we only need to define 5
> > > > of
> > > > the functions, the main one being the issue_read() function.
> > > > The issue_read() function is called by the netfs layer when a
> > > > read
> > > > cannot be fulfilled locally, and must be sent to the server
> > > > (either
> > > > the cache is not active, or it is active but the data is not
> > > > available).
> > > > Once the read from the server is complete, netfs requires a call
> > > > to
> > > > netfs_subreq_terminated() which conveys either how many bytes
> > > > were
> > > > read
> > > > successfully, or an error.  Note that issue_read() is called with
> > > > a
> > > > structure, netfs_io_subrequest, which defines the IO requested,
> > > > and
> > > > contains a start and a length (both in bytes), and assumes the
> > > > underlying
> > > > netfs will return a either an error on the whole region, or the
> > > > number
> > > > of bytes successfully read.
> > > >
> > > > The NFS IO path is page based and the main APIs are the pgio APIs
> > > > defined
> > > > in pagelist.c.  For the pgio APIs, there is no way for the caller
> > > > to
> > > > know how many RPCs will be sent and how the pages will be broken
> > > > up
> > > > into underlying RPCs, each of which will have their own
> > > > completion
> > > > and
> > > > return code.  In contrast, netfs is subrequest based, a single
> > > > subrequest may contain multiple pages, and a single subrequest is
> > > > initiated with issue_read() and terminated with
> > > > netfs_subreq_terminated().
> > > > Thus, to utilze the netfs APIs, NFS needs some way to accommodate
> > > > the netfs API requirement on the single response to the whole
> > > > subrequest, while also minimizing disruptive changes to the NFS
> > > > pgio layer.
> > > >
> > > > The approach taken with this patch is to allocate a small
> > > > structure
> > > > for each nfs_netfs_issue_read() call, store the final error and
> > > > number
> > > > of bytes successfully transferred in the structure, and update
> > > > these
> > > > values
> > > > as each RPC completes.  The refcount on the structure is used as
> > > > a
> > > > marker
> > > > for the last RPC completion, is incremented in
> > > > nfs_netfs_read_initiate(),
> > > > and decremented inside nfs_netfs_read_completion(), when a
> > > > nfs_pgio_header
> > > > contains a valid pointer to the data.  On the final put (which
> > > > signals
> > > > the final outstanding RPC is complete) in
> > > > nfs_netfs_read_completion(),
> > > > call netfs_subreq_terminated() with either the final error value
> > > > (if
> > > > one or more READs complete with an error) or the number of bytes
> > > > successfully transferred (if all RPCs complete successfully).
> > > > Note
> > > > that when all RPCs complete successfully, the number of bytes
> > > > transferred
> > > > is capped to the length of the subrequest.  Capping the
> > > > transferred
> > > > length
> > > > to the subrequest length prevents "Subreq overread" warnings from
> > > > netfs.
> > > > This is due to the "aligned_len" in nfs_pageio_add_page(), and
> > > > the
> > > > corner case where NFS requests a full page at the end of the
> > > > file,
> > > > even when i_size reflects only a partial page (NFS overread).
> > > >
> > > > Signed-off-by: Dave Wysochanski <dwysocha@xxxxxxxxxx>
> > > > Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > >
> > >
> > > This is not doing what I asked for, which was to separate out the
> > > fscache functionality, so that we can call that if and when it is
> > > available.
> > >
> > I must have misunderstood then.
> >
> > The last feedback I have from you was that you wanted it to be
> > an opt-in feature, and it was a comment on a previous patch
> > to Kconfig.  I was proceeding the best I knew how, but
> > let me try to get back on track.
> >
> > > Instead, it is just wrapping the NFS requests inside netfs
> > > requests. As
> > > it stands, that means it is just duplicating information, and
> > > adding
> > > unnecessary overhead to the standard I/O path (extra allocations,
> > > extra
> > > indirect calls, and extra bloat to the inode).
> > >
> > I think I understand what you're saying but I'm not sure.  Let me
> > ask some clarifying questions.
> >
> > Are you objecting to the code when CONFIG_NFS_FSCACHE is
> > configured?  Or when it is not?  Or both?  I think you're objecting
> > when it's configured, but not enabled (we mount without 'fsc').
> > Am I right?
> >
> > Also, are you objecting to the design that to use fcache we now
> > have to use netfs, specifically:
> > - call into netfs via either netfs_read_folio or netfs_readahead
> > - if fscache is enabled, then the IO can be satisfied from fscache
> > - if fscache is not enabled, or some of the IO cannot be satisfied
> > from the cache, then NFS is called back via netfs_issue_read
> > and we use the normal NFS read pageio interface.  This requires
> > we call netfs_subreq_terminated() when all the RPCs complete,
> > which is the reason for the small changes to pagelist.c
>
> I'm objecting to any middle layer "solution" that adds overhead to the
> NFS I/O paths.
>
Got it.

> I'm willing to consider solutions that are specific only to the fscache
> use case (i.e. when the 'fsc' mount option is specified). However when
> I perform a normal NFS mount, and do I/O, then I don't want to see
> extra memory allocations, extra indirect calls and larger inode
> footprints.
>
> IOW: I want the code to optimise for the case of standard NFS, not for
> the case of 'NFS with cachefs additions'.
>
I agree completely.  Are you seeing extra memory allocations
happen on mounts without 'fsc' or is it more a concern or how
some of the patches look?  We should not be calling any netfs or
fscache code if 'fsc' is not on the mount and I don't see any in my
testing. So either there's a misunderstanding here, or there's a
bug I'm missing.

If fscache is not configured, then nfs_netfs_read_folio() and
nfs_netfs_readahead() is a wrapper that returns -ENOBUFS.
If it's configured but not enabled, then the checks for
netfs_inode(inode)->cache should skip over any netfs code.
But maybe there's a non-obvious bug you're seeing and
somehow netfs is still getting called?  Because I cannot
see netfs getting called if 'fsc' is not on the mount in my
tests.

int nfs_netfs_read_folio(struct file *file, struct folio *folio)
{
        if (!netfs_inode(folio_inode(folio))->cache)
                return -ENOBUFS;

        return netfs_read_folio(file, folio);
}

int nfs_netfs_readahead(struct readahead_control *ractl)
{
        struct inode *inode = ractl->mapping->host;

        if (!netfs_inode(inode)->cache)
                return -ENOBUFS;

        netfs_readahead(ractl);
        return 0;
}


> >
> > Can you be more specific as to the portions of the patch you don't
> > like
> > so I can move it in the right direction?
> >
> > This is from patch #2 which you didn't comment on.  I'm not sure
> > you're
> > ok with it though, since you mention "extra bloat to the inode".
> > Do you object to this even though it's wrapped in an
> > #ifdef CONFIG_NFS_FSCACHE?  If so, do you require no
> > extra size be added to nfs_inode?
> >
> > @@ -204,9 +208,11 @@ struct nfs_inode {
> >         __u64 write_io;
> >         __u64 read_io;
> >  #ifdef CONFIG_NFS_FSCACHE
> > -       struct fscache_cookie   *fscache;
> > -#endif
> > +       struct netfs_inode      netfs; /* netfs context and VFS inode
> > */
> > +#else
> >         struct inode            vfs_inode;
> > +#endif
> > +
>
> Ideally, I'd prefer no extra size. I can live with it up to a certain
> point, however for now NFS is not unconditionally opting into the netfs
> project. If we're to ever do that, then I want to see streamlined code
> for the standard I/O case.
>
Ok and understood about standard I/O case.

I was thinking how we might not increase the size, but I don't think
I can make it work.

I thought we could change to something like the below, without an
embedded struct inode:

@@ -204,9 +208,11 @@ struct nfs_inode {
        __u64 write_io;
        __u64 read_io;
 #ifdef CONFIG_NFS_FSCACHE
-       struct fscache_cookie   *fscache;
-#endif
+       struct netfs_inode      *netfs; /* netfs context and VFS inode */
+#else
        struct inode            vfs_inode;
+#endif
+

Then I would need to alloc/free a netfs_inode at the time of
nfs_inode initiation.  Unfortunately this has the issue that the NFS_I()
macro cannot work, because it requires an embedded "struct inode"
due to "container_of" use:

+#ifdef CONFIG_NFS_FSCACHE
+static inline struct inode *VFS_I(struct nfs_inode *nfsi)
+{
+       return &nfsi->netfs.inode;
+}
+static inline struct nfs_inode *NFS_I(const struct inode *inode)
+{
+       return container_of(inode, struct nfs_inode, netfs.inode);
+}
+#else
+static inline struct inode *VFS_I(struct nfs_inode *nfsi)
+{
+       return &nfsi->vfs_inode;
+}
 static inline struct nfs_inode *NFS_I(const struct inode *inode)
 {
        return container_of(inode, struct nfs_inode, vfs_inode);
 }
+#endif



> >
> >
> > Are you ok with the stub functions which are placed in fscache.h, and
> > when CONFIG_NFS_FSCACHE is not set, become either a no-op
> > or a 1-liner (nfs_netfs_readpage_release)?
> >
> >  #else /* CONFIG_NFS_FSCACHE */
> > +static inline void nfs_netfs_inode_init(struct nfs_inode *nfsi) {}
> > +static inline void nfs_netfs_initiate_read(struct nfs_pgio_header
> > *hdr) {}
> > +static inline void nfs_netfs_read_completion(struct nfs_pgio_header
> > *hdr) {}
> > +static inline void nfs_netfs_readpage_release(struct nfs_page *req)
> > +{
> > +       unlock_page(req->wb_page);
> > +}
> >  static inline void nfs_fscache_release_super_cookie(struct
> > super_block *sb) {}
> >  static inline void nfs_fscache_init_inode(struct inode *inode) {}
> >
> >
> > Do you object to the below?  If so, then do you want
> > #ifdef CONFIG_NFS_FSCACHE here?
> >
> > -- a/fs/nfs/inode.c
> > +++ b/fs/nfs/inode.c
> > @@ -2249,6 +2249,8 @@ struct inode *nfs_alloc_inode(struct
> > super_block *sb)
> >  #ifdef CONFIG_NFS_V4_2
> >         nfsi->xattr_cache = NULL;
> >  #endif
> > +       nfs_netfs_inode_init(nfsi);
> > +
> >         return VFS_I(nfsi);
> >  }
> >  EXPORT_SYMBOL_GPL(nfs_alloc_i
> > node);
> >
> >
> > Do you object to the changes in fs/nfs/read.c?  Specifically,
> > how about the below calls to netfs from nfs_read_folio and
> > nfs_readahead into equivalent netfs calls?  So when
> > NFS_CONFIG_FSCACHE is set, but fscache is not enabled
> > ('fsc' not on mount), these netfs functions do immediately call
> > netfs_alloc_request().  But I wonder if we could simply add a
> > check to see if fscache is enabled on the mount, and skip
> > over to satisfy what you want.  Am I understanding what you
> > want?
>
> Quite frankly, I'd prefer that we just split out the functionality that
> is needed from the netfs code so that it can be optimised. However I'm
> not interested enough in the cachefs functionality to work on that
> myself. ...and as I indicated above, I might be OK with opting into the
> netfs project, once the overhead can be made to disappear.
>
Understood.

If you think it makes more sense, I can move some of the nfs_netfs_*
functions into a netfs.c file as a starting point.  Or that can maybe
be done in a future patchset?

For now I was equating netfs and fscache together so we can
move on from the much older and single-page limiting fscache
interface that is likely to go away soon.

> >
> > @@ -355,6 +343,10 @@ int nfs_read_folio(struct file *file, struct
> > folio *folio)
> >         if (NFS_STALE(inode))
> >                 goto out_unlock;
> >
> > +       ret = nfs_netfs_read_folio(file, folio);
> > +       if (!ret)
> > +               goto out;
> > +
> >
> > @@ -405,6 +399,10 @@ void nfs_readahead(struct readahead_control
> > *ractl)
> >         if (NFS_STALE(inode))
> >                 goto out;
> >
> > +       ret = nfs_netfs_readahead(ractl);
> > +       if (!ret)
> > +               goto out;
> > +
> >
The above wrappers should prevent any additional overhead when fscache
is not enabled.  As far as I know these work to avoid calling netfs
when 'fsc' is not on the mount.

> >
> > And how about these calls from different points in the read
> > path to the earlier mentioned stub functions?
> >
> > @@ -110,20 +110,13 @@ EXPORT_SYMBOL_GPL(nfs_pageio_reset_read_mds);
> >
> >  static void nfs_readpage_release(struct nfs_page *req, int error)
> >  {
> > -       struct inode *inode = d_inode(nfs_req_openctx(req)->dentry);
> >         struct page *page = req->wb_page;
> >
> > -       dprintk("NFS: read done (%s/%llu %d@%lld)\n", inode->i_sb-
> > >s_id,
> > -               (unsigned long long)NFS_FILEID(inode), req->wb_bytes,
> > -               (long long)req_offset(req));
> > -
> >         if (nfs_error_is_fatal_on_server(error) && error != -
> > ETIMEDOUT)
> >                 SetPageError(page);
> > -       if (nfs_page_group_sync_on_bit(req, PG_UNLOCKPAGE)) {
> > -               if (PageUptodate(page))
> > -                       nfs_fscache_write_page(inode, page);
> > -               unlock_page(page);
> > -       }
> > +       if (nfs_page_group_sync_on_bit(req, PG_UNLOCKPAGE))
> > +               nfs_netfs_readpage_release(req);
> > +
>
> I'm not seeing the value of wrapping unlock_page(), no... That code is
> going to need to change when we move it to use folios natively anyway.
>
Ok, how about I make it conditional on whether fscache is configured
and enabled then, similar to the nfs_netfs_read_folio() and
nfs_netfs_readahead()?  Below is what that would look like.
I could inline the code in nfs_netfs_readpage_release() if you
think it would be clearer.

static void nfs_readpage_release(struct nfs_page *req, int error)
{
        struct page *page = req->wb_page;

        if (nfs_error_is_fatal_on_server(error) && error != -ETIMEDOUT)
                SetPageError(page);
        if (nfs_page_group_sync_on_bit(req, PG_UNLOCKPAGE))
#ifndef CONFIG_NFS_FSCACHE
                unlock_page(req->wb_page);
#else
                nfs_netfs_readpage_release(req);
#endif
        nfs_release_request(req);
}


void nfs_netfs_readpage_release(struct nfs_page *req)
{
    struct inode *inode = d_inode(nfs_req_openctx(req)->dentry);

    /*
     * If fscache is enabled, netfs will unlock pages.
     */
    if (netfs_inode(inode)->cache)
        return;

    unlock_page(req->wb_page);
}


> >         nfs_release_request(req);
> >  }
> >
> > @@ -177,6 +170,8 @@ static void nfs_read_completion(struct
> > nfs_pgio_header *hdr)
> >                 nfs_list_remove_request(req);
> >                 nfs_readpage_release(req, error);
> >         }
> > +       nfs_netfs_read_completion(hdr);
> > +
> >  out:
> >         hdr->release(hdr);
> >  }
> > @@ -187,6 +182,7 @@ static void nfs_initiate_read(struct
> > nfs_pgio_header *hdr,
> >                               struct rpc_task_setup *task_setup_data,
> > int how)
> >  {
> >         rpc_ops->read_setup(hdr, msg);
> > +       nfs_netfs_initiate_read(hdr);
> >         trace_nfs_initiate_read(hdr);
> >  }
> >
> >
> > Are you ok with these additions?  Something like this would
> > be required in the case of fscache configured and enabled,
> > because we could have some of the data in a read in
> > fscache, and some not.  That is the reason for the netfs
> > design, and why we need to be able to call the normal
> > NFS read IO path (netfs calls into issue_read, and we call
> > back via netfs_subreq_terminated)?
> >
> > @@ -101,6 +101,9 @@ struct nfs_pageio_descriptor {
> >         struct pnfs_layout_segment *pg_lseg;
> >         struct nfs_io_completion *pg_io_completion;
> >         struct nfs_direct_req   *pg_dreq;
> > +#ifdef CONFIG_NFS_FSCACHE
> > +       void                    *pg_netfs;
> > +#endif
> >
> > @@ -1619,6 +1619,9 @@ struct nfs_pgio_header {
> >         const struct nfs_rw_ops *rw_ops;
> >         struct nfs_io_completion *io_completion;
> >         struct nfs_direct_req   *dreq;
> > +#ifdef CONFIG_NFS_FSCACHE
> > +       void                    *netfs;
> > +#endif
> >
> >
> > And these additions to pagelist.c?
> >
> > @@ -68,6 +69,10 @@ void nfs_pgheader_init(struct
> > nfs_pageio_descriptor *desc,
> >         hdr->good_bytes = mirror->pg_count;
> >         hdr->io_completion = desc->pg_io_completion;
> >         hdr->dreq = desc->pg_dreq;
> > +#ifdef CONFIG_NFS_FSCACHE
> > +       if (desc->pg_netfs)
> > +               hdr->netfs = desc->pg_netfs;
> > +#endif
>
> Why the conditional?
>
Not really needed and I was thinking of removing it, so I'll do that.

> >
> >
> > @@ -846,6 +851,9 @@ void nfs_pageio_init(struct nfs_pageio_descriptor
> > *desc,
> >         desc->pg_lseg = NULL;
> >         desc->pg_io_completion = NULL;
> >         desc->pg_dreq = NULL;
> > +#ifdef CONFIG_NFS_FSCACHE
> > +       desc->pg_netfs = NULL;
> > +#endif
> >
> >
> > @@ -1360,6 +1369,9 @@ int nfs_pageio_resend(struct
> > nfs_pageio_descriptor *desc,
> >
> >         desc->pg_io_completion = hdr->io_completion;
> >         desc->pg_dreq = hdr->dreq;
> > +#ifdef CONFIG_NFS_FSCACHE
> > +       desc->pg_netfs = hdr->netfs;
> > +#endif
>
> Those all need wrapper functions instead of embedding #ifdefs.
>
Ok.



> >
> >
> > > My expectation is that the standard I/O path should have minimal
> > > overhead, and should certainly not increase the overhead that we
> > > already have. Will this be addressed in future iterations of these
> > > patches?
> > >
> >
> > I will do what I can to satisfy what you want, either by fixing up
> > this patch or follow-on patches.  Hopefully the above questions
> > will clarify the next steps.
> >
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>