On Mon, Jul 01, 2024 at 08:22:56AM +1000, NeilBrown wrote: > On Sun, 30 Jun 2024, Chuck Lever wrote: > > Sorry, I guess I expected to have more time to learn about these > > patches before writing review comments. But if you want them to go > > in soon, I had better look more closely at them now. > > > > > > On Fri, Jun 28, 2024 at 05:10:59PM -0400, Mike Snitzer wrote: > > > Pass the stored cl_nfssvc_net from the client to the server as > > > > This is the only mention of cl_nfssvc_net I can find in this > > patch. I'm not sure what it is. Patch description should maybe > > provide some context. > > > > > > > first argument to nfsd_open_local_fh() to ensure the proper network > > > namespace is used for localio. > > > > Can the patch description say something about the distinct mount > > namespaces -- if the local application is running in a different > > container than the NFS server, are we using only the network > > namespaces for authorizing the file access? And is that OK to do? > > If yes, patch description should explain that NFS local I/O ignores > > the boundaries of mount namespaces and why that is OK to do. > > > > > > > Signed-off-by: Weston Andros Adamson <dros@xxxxxxxxxxxxxxx> > > > Signed-off-by: Peng Tao <tao.peng@xxxxxxxxxxxxxxx> > > > Signed-off-by: Lance Shelton <lance.shelton@xxxxxxxxxxxxxxx> > > > Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > Signed-off-by: Mike Snitzer <snitzer@xxxxxxxxxx> > > > --- > > > fs/nfsd/Makefile | 1 + > > > fs/nfsd/filecache.c | 2 +- > > > fs/nfsd/localio.c | 239 ++++++++++++++++++++++++++++++++++++++++++++ > > > fs/nfsd/nfssvc.c | 1 + > > > fs/nfsd/trace.h | 3 +- > > > fs/nfsd/vfs.h | 9 ++ > > > 6 files changed, 253 insertions(+), 2 deletions(-) > > > create mode 100644 fs/nfsd/localio.c > > > > > > diff --git a/fs/nfsd/Makefile b/fs/nfsd/Makefile > > > index b8736a82e57c..78b421778a79 100644 > > > --- a/fs/nfsd/Makefile > > > +++ b/fs/nfsd/Makefile > > > @@ -23,3 +23,4 @@ nfsd-$(CONFIG_NFSD_PNFS) += nfs4layouts.o > > > nfsd-$(CONFIG_NFSD_BLOCKLAYOUT) += blocklayout.o blocklayoutxdr.o > > > nfsd-$(CONFIG_NFSD_SCSILAYOUT) += blocklayout.o blocklayoutxdr.o > > > nfsd-$(CONFIG_NFSD_FLEXFILELAYOUT) += flexfilelayout.o flexfilelayoutxdr.o > > > +nfsd-$(CONFIG_NFSD_LOCALIO) += localio.o > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index ad9083ca144b..99631fa56662 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -52,7 +52,7 @@ > > > #define NFSD_FILE_CACHE_UP (0) > > > > > > /* We only care about NFSD_MAY_READ/WRITE for this cache */ > > > -#define NFSD_FILE_MAY_MASK (NFSD_MAY_READ|NFSD_MAY_WRITE) > > > +#define NFSD_FILE_MAY_MASK (NFSD_MAY_READ|NFSD_MAY_WRITE|NFSD_MAY_LOCALIO) > > > > > > static DEFINE_PER_CPU(unsigned long, nfsd_file_cache_hits); > > > static DEFINE_PER_CPU(unsigned long, nfsd_file_acquisitions); > > > diff --git a/fs/nfsd/localio.c b/fs/nfsd/localio.c > > > new file mode 100644 > > > index 000000000000..759a5cb79652 > > > --- /dev/null > > > +++ b/fs/nfsd/localio.c > > > @@ -0,0 +1,239 @@ > > > +// SPDX-License-Identifier: GPL-2.0-only > > > +/* > > > + * NFS server support for local clients to bypass network stack > > > + * > > > + * Copyright (C) 2014 Weston Andros Adamson <dros@xxxxxxxxxxxxxxx> > > > + * Copyright (C) 2019 Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > + * Copyright (C) 2024 Mike Snitzer <snitzer@xxxxxxxxxxxxxxx> > > > + */ > > > + > > > +#include <linux/exportfs.h> > > > +#include <linux/sunrpc/svcauth_gss.h> > > > +#include <linux/sunrpc/clnt.h> > > > +#include <linux/nfs.h> > > > +#include <linux/string.h> > > > + > > > +#include "nfsd.h" > > > +#include "vfs.h" > > > +#include "netns.h" > > > +#include "filecache.h" > > > + > > > +#define NFSDDBG_FACILITY NFSDDBG_FH > > > > With no more dprintk() call sites in this patch, you no longer need > > this macro definition. > > > > > > > +/* > > > + * We need to translate between nfs status return values and > > > + * the local errno values which may not be the same. > > > + * - duplicated from fs/nfs/nfs2xdr.c to avoid needless bloat of > > > + * all compiled nfs objects if it were in include/linux/nfs.h > > > + */ > > > +static const struct { > > > + int stat; > > > + int errno; > > > +} nfs_common_errtbl[] = { > > > + { NFS_OK, 0 }, > > > + { NFSERR_PERM, -EPERM }, > > > + { NFSERR_NOENT, -ENOENT }, > > > + { NFSERR_IO, -EIO }, > > > + { NFSERR_NXIO, -ENXIO }, > > > +/* { NFSERR_EAGAIN, -EAGAIN }, */ > > > + { NFSERR_ACCES, -EACCES }, > > > + { NFSERR_EXIST, -EEXIST }, > > > + { NFSERR_XDEV, -EXDEV }, > > > + { NFSERR_NODEV, -ENODEV }, > > > + { NFSERR_NOTDIR, -ENOTDIR }, > > > + { NFSERR_ISDIR, -EISDIR }, > > > + { NFSERR_INVAL, -EINVAL }, > > > + { NFSERR_FBIG, -EFBIG }, > > > + { NFSERR_NOSPC, -ENOSPC }, > > > + { NFSERR_ROFS, -EROFS }, > > > + { NFSERR_MLINK, -EMLINK }, > > > + { NFSERR_NAMETOOLONG, -ENAMETOOLONG }, > > > + { NFSERR_NOTEMPTY, -ENOTEMPTY }, > > > + { NFSERR_DQUOT, -EDQUOT }, > > > + { NFSERR_STALE, -ESTALE }, > > > + { NFSERR_REMOTE, -EREMOTE }, > > > +#ifdef EWFLUSH > > > + { NFSERR_WFLUSH, -EWFLUSH }, > > > +#endif > > > + { NFSERR_BADHANDLE, -EBADHANDLE }, > > > + { NFSERR_NOT_SYNC, -ENOTSYNC }, > > > + { NFSERR_BAD_COOKIE, -EBADCOOKIE }, > > > + { NFSERR_NOTSUPP, -ENOTSUPP }, > > > + { NFSERR_TOOSMALL, -ETOOSMALL }, > > > + { NFSERR_SERVERFAULT, -EREMOTEIO }, > > > + { NFSERR_BADTYPE, -EBADTYPE }, > > > + { NFSERR_JUKEBOX, -EJUKEBOX }, > > > + { -1, -EIO } > > > +}; > > > + > > > +/** > > > + * nfs_stat_to_errno - convert an NFS status code to a local errno > > > + * @status: NFS status code to convert > > > + * > > > + * Returns a local errno value, or -EIO if the NFS status code is > > > + * not recognized. nfsd_file_acquire() returns an nfsstat that > > > + * needs to be translated to an errno before being returned to a > > > + * local client application. > > > + */ > > > +static int nfs_stat_to_errno(enum nfs_stat status) > > > +{ > > > + int i; > > > + > > > + for (i = 0; nfs_common_errtbl[i].stat != -1; i++) { > > > + if (nfs_common_errtbl[i].stat == (int)status) > > > + return nfs_common_errtbl[i].errno; > > > + } > > > + return nfs_common_errtbl[i].errno; > > > +} > > > + > > > +static void > > > +nfsd_local_fakerqst_destroy(struct svc_rqst *rqstp) > > > +{ > > > + if (rqstp->rq_client) > > > + auth_domain_put(rqstp->rq_client); > > > + if (rqstp->rq_cred.cr_group_info) > > > + put_group_info(rqstp->rq_cred.cr_group_info); > > > + /* rpcauth_map_to_svc_cred_local() clears cr_principal */ > > > + WARN_ON_ONCE(rqstp->rq_cred.cr_principal != NULL); > > > + kfree(rqstp->rq_xprt); > > > + kfree(rqstp); > > > +} > > > + > > > +static struct svc_rqst * > > > +nfsd_local_fakerqst_create(struct net *net, struct rpc_clnt *rpc_clnt, > > > + const struct cred *cred) > > > +{ > > > + struct svc_rqst *rqstp; > > > + struct nfsd_net *nn = net_generic(net, nfsd_net_id); > > > + int status; > > > + > > > + /* FIXME: not running in nfsd context, must get reference on nfsd_serv */ > > > + if (unlikely(!READ_ONCE(nn->nfsd_serv))) > > > + return ERR_PTR(-ENXIO); > > > + > > > + rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL); > > > + if (!rqstp) > > > + return ERR_PTR(-ENOMEM); > > > + > > > + rqstp->rq_xprt = kzalloc(sizeof(*rqstp->rq_xprt), GFP_KERNEL); > > > + if (!rqstp->rq_xprt) { > > > + status = -ENOMEM; > > > + goto out_err; > > > + } > > > > struct svc_rqst is pretty big (like, bigger than a couple of pages). > > What happens if this allocation fails? > > > > And how often does it occur -- does that add significant overhead? > > > > > > > + > > > + rqstp->rq_xprt->xpt_net = net; > > > + __set_bit(RQ_SECURE, &rqstp->rq_flags); > > > + rqstp->rq_proc = 1; > > > + rqstp->rq_vers = 3; > > > > IMO these need to be symbolic constants, not integers. Or, at least > > there needs to be some documenting comments that explain these are > > fake and why that's OK to do. Or, are there better choices? > > > > > > > + rqstp->rq_prot = IPPROTO_TCP; > > > + rqstp->rq_server = nn->nfsd_serv; > > > + > > > + /* Note: we're connecting to ourself, so source addr == peer addr */ > > > + rqstp->rq_addrlen = rpc_peeraddr(rpc_clnt, > > > + (struct sockaddr *)&rqstp->rq_addr, > > > + sizeof(rqstp->rq_addr)); > > > + > > > + rpcauth_map_to_svc_cred_local(rpc_clnt->cl_auth, cred, &rqstp->rq_cred); > > > + > > > + /* > > > + * set up enough for svcauth_unix_set_client to be able to wait > > > + * for the cache downcall. Note that we do _not_ want to allow the > > > + * request to be deferred for later revisit since this rqst and xprt > > > + * are not set up to run inside of the normal svc_rqst engine. > > > + */ > > > + INIT_LIST_HEAD(&rqstp->rq_xprt->xpt_deferred); > > > + kref_init(&rqstp->rq_xprt->xpt_ref); > > > + spin_lock_init(&rqstp->rq_xprt->xpt_lock); > > > + rqstp->rq_chandle.thread_wait = 5 * HZ; > > > + > > > + status = svcauth_unix_set_client(rqstp); > > > + switch (status) { > > > + case SVC_OK: > > > + break; > > > + case SVC_DENIED: > > > + status = -ENXIO; > > > + goto out_err; > > > + default: > > > + status = -ETIMEDOUT; > > > + goto out_err; > > > + } > > > > Interesting. Why would svcauth_unix_set_client fail for a local I/O > > request? Wouldn't it only be because the local application is trying > > to open a file it doesn't have permission to? > > > > I'm beginning to think this section of code is the of the sort where you > need to be twice as clever when debugging as you where when writing. It > is trying to get the client to use interfaces written for server-side > actions, and it isn't a good fit. > > I think that instead we should modify fh_verify() so that it takes > explicit net, rq_vers, rq_cred, rq_client as well as the rqstp, and > the localio client passes in a NULL rqstp. Nit: I'd rather provide a new fh_verify-like API -- changing the synopsis of fh_verify() itself will result in a lot of code churn for only a single call site. > Getting the rq_client is an interesting challenge. > The above code (if I'm reading it correctly) gets the server-side > address of the IP connection, and passes that through to the sunrpc code > as though it is the client address. So as long as the server is > exporting to itself, and as long as no address translation is happening > on the path, this works. It feels messy though - and fragile. > > I would rather we had some rq_client (struct auth_domain) that was > dedicated to localio. The client should be able to access it based on > the fact that it could rather the server UUID using the LOCALIO RPC > protocol. > > I'm not sure what exactly this would look like, but the > 'struct auth_domain *' should be something that can be accessed > directly, not looked up in a cache. I'd like to mitigate the possibility of having to wait for a possible cache upcall, and reduce or remove the need for a phony svc_rqst. It sounds like you are on that path. Further, this needs to be clearly documented -- it's bypassing (or perhaps augmenting) the export's usual IP address-based authorization mechanism, so there are security considerations. > I can try to knock up a patch to allow fh_verify (and nfsd_file_acquire) > without an rqstp. I won't try the auth_domain change until I hear what > others think. -- Chuck Lever