Re: [PATCH v1 1/2] NFSD: Clean up write argument XDR decoders

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 23 Mar 2018 18:37:51 -0400

> On Mar 23, 2018, at 5:55 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> 
> 
> 
>> On Mar 23, 2018, at 5:32 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
>> 
>> If I understand correctly, in the v4 case you change the way the page
>> data is passed by using rq_arg.pages instead of a list of pages for each
>> write.
>> 
>> I don't think that handles the case of compounds with multiple writes.
>> 
>> If all the writes are contained in a page, then you may be OK, since the
>> wr_head iovec is all the information you need to pass the write data
>> from the xdr code to the proc code.
>> 
>> But if there are multiple larger writes, you also need to know where the
>> page data is.  You're depending on rq_arg.pages for that, but there's
>> only one of those per compound.
> 
> I thought we were going to handle that case by chaining xdr_bufs--
> one per NFSv4 WRITE operation. There has to be some structured way
> to pass distinct write payloads up to the NFS server, otherwise
> direct placement is impossible.

Thinking on this a little more, I think you are saying the new shared
decoder is adequate for NFSv2 and NFSv3, but not for NFSv4. Would you
accept a patch that kept the NFSv2 and NFSv3 parts of this patch, but
dropped the NFSv4-related hunks?

Likewise for the symlink decoder patch?

And we can still use a transport call-out in the XDR decoders, just as
we have discussed. (I think I've misremembered our discussion about
chaining xdr_bufs).

It's just that the XDR decoder can't be shared between NFSv4 and the
NFSv2/NFSv3 code paths.

>> To catch the regression I think we'd need a test that sends a compound
>> with two greater-than-one-page writes and distinct write data and then
>> reads back the results to check they're correct.  I don't think we have
>> one currently, though it wouldn't be hard to do in pynfs.
>> 
>> It's honestly not an important case, but I'd rather not break it.
> 
> I believe Solaris handles direct writev() using multiple WRITE
> operations per COMPOUND, but I could be wrong.
> 
> 
>> --b.
>> 
>> On Mon, Mar 19, 2018 at 02:56:48PM -0400, Chuck Lever wrote:
>>> Move common code in NFSD's write arg decoders into a helper. The
>>> immediate benefit is reduction of code duplication and some nice
>>> micro-optimizations (see below).
>>> 
>>> However, in the long term, this helper could perform a per-transport
>>> call-out to fill the rq_vec (say, using RDMA Reads).
>>> 
>>> The legacy WRITE decoders and procs are changed to work like NFSv4,
>>> which constructs the rq_vec just before it is about to call
>>> vfs_writev.
>>> 
>>> Why? Calling a transport call-out from the proc instead of the XDR
>>> decoder means that the incoming FH can be resolved to a particular
>>> filesystem and file. This would allow pages from the backing file to
>>> be presented to the transport to be filled, rather than presenting
>>> anonymous pages and copying or swapping them into the file's page
>>> cache later.
>>> 
>>> I also prefer using the pages in rq_arg.pages, instead of pulling
>>> the data pages directly out of the rqstp::rq_pages array. This is
>>> currently the way the NFSv3 write decoder works, but the other two
>>> do not seem to take this approach. Fixing this removes the only
>>> reference to rq_pages found in NFSD, eliminating an NFSD assumption
>>> about how transports use the pages in rq_pages.
>>> 
>>> Lastly, avoid setting up the first element of rq_vec as a zero-
>>> length buffer. This happens with an RDMA transport when a normal
>>> Read chunk is present because the data payload is in rq_arg's
>>> page list (none of it is in the head buffer).
>>> 
>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>> ---
>>> fs/nfsd/nfs3proc.c         |    8 ++++++--
>>> fs/nfsd/nfs3xdr.c          |   16 ++++------------
>>> fs/nfsd/nfs4proc.c         |   30 ++++++------------------------
>>> fs/nfsd/nfs4xdr.c          |    1 -
>>> fs/nfsd/nfsproc.c          |    9 +++++++--
>>> fs/nfsd/nfsxdr.c           |   14 ++------------
>>> fs/nfsd/xdr.h              |    2 +-
>>> fs/nfsd/xdr3.h             |    2 +-
>>> fs/nfsd/xdr4.h             |    1 -
>>> include/linux/sunrpc/svc.h |    2 ++
>>> net/sunrpc/svc.c           |   42 ++++++++++++++++++++++++++++++++++++++++++
>>> 11 files changed, 71 insertions(+), 56 deletions(-)
>>> 
>>> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
>>> index 1d0ce3c..2dd95eb 100644
>>> --- a/fs/nfsd/nfs3proc.c
>>> +++ b/fs/nfsd/nfs3proc.c
>>> @@ -192,6 +192,7 @@
>>> 	struct nfsd3_writeres *resp = rqstp->rq_resp;
>>> 	__be32	nfserr;
>>> 	unsigned long cnt = argp->len;
>>> +	unsigned int nvecs;
>>> 
>>> 	dprintk("nfsd: WRITE(3)    %s %d bytes at %Lu%s\n",
>>> 				SVCFH_fmt(&argp->fh),
>>> @@ -201,9 +202,12 @@
>>> 
>>> 	fh_copy(&resp->fh, &argp->fh);
>>> 	resp->committed = argp->stable;
>>> +	nvecs = svc_fill_write_vector(rqstp, &argp->first, cnt);
>>> +	if (!nvecs)
>>> +		RETURN_STATUS(nfserr_io);
>>> 	nfserr = nfsd_write(rqstp, &resp->fh, argp->offset,
>>> -				rqstp->rq_vec, argp->vlen,
>>> -				&cnt, resp->committed);
>>> +			    rqstp->rq_vec, nvecs, &cnt,
>>> +			    resp->committed);
>>> 	resp->count = cnt;
>>> 	RETURN_STATUS(nfserr);
>>> }
>>> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
>>> index 1a70581..e19fc5d 100644
>>> --- a/fs/nfsd/nfs3xdr.c
>>> +++ b/fs/nfsd/nfs3xdr.c
>>> @@ -391,7 +391,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>>> nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p)
>>> {
>>> 	struct nfsd3_writeargs *args = rqstp->rq_argp;
>>> -	unsigned int len, v, hdr, dlen;
>>> +	unsigned int len, hdr, dlen;
>>> 	u32 max_blocksize = svc_max_payload(rqstp);
>>> 	struct kvec *head = rqstp->rq_arg.head;
>>> 	struct kvec *tail = rqstp->rq_arg.tail;
>>> @@ -433,17 +433,9 @@ void fill_post_wcc(struct svc_fh *fhp)
>>> 		args->count = max_blocksize;
>>> 		len = args->len = max_blocksize;
>>> 	}
>>> -	rqstp->rq_vec[0].iov_base = (void*)p;
>>> -	rqstp->rq_vec[0].iov_len = head->iov_len - hdr;
>>> -	v = 0;
>>> -	while (len > rqstp->rq_vec[v].iov_len) {
>>> -		len -= rqstp->rq_vec[v].iov_len;
>>> -		v++;
>>> -		rqstp->rq_vec[v].iov_base = page_address(rqstp->rq_pages[v]);
>>> -		rqstp->rq_vec[v].iov_len = PAGE_SIZE;
>>> -	}
>>> -	rqstp->rq_vec[v].iov_len = len;
>>> -	args->vlen = v + 1;
>>> +
>>> +	args->first.iov_base = (void *)p;
>>> +	args->first.iov_len = head->iov_len - hdr;
>>> 	return 1;
>>> }
>>> 
>>> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
>>> index 0df37e0..cf30544 100644
>>> --- a/fs/nfsd/nfs4proc.c
>>> +++ b/fs/nfsd/nfs4proc.c
>>> @@ -974,24 +974,6 @@ static __be32 nfsd4_do_lookupp(struct svc_rqst *rqstp, struct svc_fh *fh)
>>> 	return status;
>>> }
>>> 
>>> -static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
>>> -{
>>> -        int i = 1;
>>> -        int buflen = write->wr_buflen;
>>> -
>>> -        vec[0].iov_base = write->wr_head.iov_base;
>>> -        vec[0].iov_len = min_t(int, buflen, write->wr_head.iov_len);
>>> -        buflen -= vec[0].iov_len;
>>> -
>>> -        while (buflen) {
>>> -                vec[i].iov_base = page_address(write->wr_pagelist[i - 1]);
>>> -                vec[i].iov_len = min_t(int, PAGE_SIZE, buflen);
>>> -                buflen -= vec[i].iov_len;
>>> -                i++;
>>> -        }
>>> -        return i;
>>> -}
>>> -
>>> static __be32
>>> nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>>> 	    union nfsd4_op_u *u)
>>> @@ -1000,8 +982,8 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
>>> 	stateid_t *stateid = &write->wr_stateid;
>>> 	struct file *filp = NULL;
>>> 	__be32 status = nfs_ok;
>>> +	unsigned int nvecs;
>>> 	unsigned long cnt;
>>> -	int nvecs;
>>> 
>>> 	if (write->wr_offset >= OFFSET_MAX)
>>> 		return nfserr_inval;
>>> @@ -1019,12 +1001,12 @@ static int fill_in_write_vector(struct kvec *vec, struct nfsd4_write *write)
>>> 	write->wr_how_written = write->wr_stable_how;
>>> 	gen_boot_verifier(&write->wr_verifier, SVC_NET(rqstp));
>>> 
>>> -	nvecs = fill_in_write_vector(rqstp->rq_vec, write);
>>> -	WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
>>> -
>>> +	nvecs = svc_fill_write_vector(rqstp, &write->wr_head, cnt);
>>> +	if (!nvecs)
>>> +		return nfserr_io;
>>> 	status = nfsd_vfs_write(rqstp, &cstate->current_fh, filp,
>>> -				write->wr_offset, rqstp->rq_vec, nvecs, &cnt,
>>> -				write->wr_how_written);
>>> +				write->wr_offset, rqstp->rq_vec, nvecs,
>>> +				&cnt, write->wr_how_written);
>>> 	fput(filp);
>>> 
>>> 	write->wr_bytes_written = cnt;
>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>> index f75cc7d..d33c8aa 100644
>>> --- a/fs/nfsd/nfs4xdr.c
>>> +++ b/fs/nfsd/nfs4xdr.c
>>> @@ -1286,7 +1286,6 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compoundargs *argp, struct xdr_ne
>>> 	}
>>> 	write->wr_head.iov_base = p;
>>> 	write->wr_head.iov_len = avail;
>>> -	write->wr_pagelist = argp->pagelist;
>>> 
>>> 	len = XDR_QUADLEN(write->wr_buflen) << 2;
>>> 	if (len >= avail) {
>>> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
>>> index 43c0419..1995ea6 100644
>>> --- a/fs/nfsd/nfsproc.c
>>> +++ b/fs/nfsd/nfsproc.c
>>> @@ -212,13 +212,18 @@
>>> 	struct nfsd_attrstat *resp = rqstp->rq_resp;
>>> 	__be32	nfserr;
>>> 	unsigned long cnt = argp->len;
>>> +	unsigned int nvecs;
>>> 
>>> 	dprintk("nfsd: WRITE    %s %d bytes at %d\n",
>>> 		SVCFH_fmt(&argp->fh),
>>> 		argp->len, argp->offset);
>>> 
>>> -	nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh), argp->offset,
>>> -				rqstp->rq_vec, argp->vlen, &cnt, NFS_DATA_SYNC);
>>> +	nvecs = svc_fill_write_vector(rqstp, &argp->first, cnt);
>>> +	if (!nvecs)
>>> +		return nfserr_io;
>>> +	nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh),
>>> +			    argp->offset, rqstp->rq_vec, nvecs,
>>> +			    &cnt, NFS_DATA_SYNC);
>>> 	return nfsd_return_attrs(nfserr, resp);
>>> }
>>> 
>>> diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
>>> index 79b6064..db24ae8 100644
>>> --- a/fs/nfsd/nfsxdr.c
>>> +++ b/fs/nfsd/nfsxdr.c
>>> @@ -287,7 +287,6 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *f
>>> 	struct nfsd_writeargs *args = rqstp->rq_argp;
>>> 	unsigned int len, hdr, dlen;
>>> 	struct kvec *head = rqstp->rq_arg.head;
>>> -	int v;
>>> 
>>> 	p = decode_fh(p, &args->fh);
>>> 	if (!p)
>>> @@ -323,17 +322,8 @@ __be32 *nfs2svc_encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *f
>>> 	if (dlen < XDR_QUADLEN(len)*4)
>>> 		return 0;
>>> 
>>> -	rqstp->rq_vec[0].iov_base = (void*)p;
>>> -	rqstp->rq_vec[0].iov_len = head->iov_len - hdr;
>>> -	v = 0;
>>> -	while (len > rqstp->rq_vec[v].iov_len) {
>>> -		len -= rqstp->rq_vec[v].iov_len;
>>> -		v++;
>>> -		rqstp->rq_vec[v].iov_base = page_address(rqstp->rq_pages[v]);
>>> -		rqstp->rq_vec[v].iov_len = PAGE_SIZE;
>>> -	}
>>> -	rqstp->rq_vec[v].iov_len = len;
>>> -	args->vlen = v + 1;
>>> +	args->first.iov_base = (void *)p;
>>> +	args->first.iov_len = head->iov_len - hdr;
>>> 	return 1;
>>> }
>>> 
>>> diff --git a/fs/nfsd/xdr.h b/fs/nfsd/xdr.h
>>> index 2f4f22e..a765c41 100644
>>> --- a/fs/nfsd/xdr.h
>>> +++ b/fs/nfsd/xdr.h
>>> @@ -34,7 +34,7 @@ struct nfsd_writeargs {
>>> 	svc_fh			fh;
>>> 	__u32			offset;
>>> 	int			len;
>>> -	int			vlen;
>>> +	struct kvec		first;
>>> };
>>> 
>>> struct nfsd_createargs {
>>> diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
>>> index 056bf8a..deccf7f 100644
>>> --- a/fs/nfsd/xdr3.h
>>> +++ b/fs/nfsd/xdr3.h
>>> @@ -41,7 +41,7 @@ struct nfsd3_writeargs {
>>> 	__u32			count;
>>> 	int			stable;
>>> 	__u32			len;
>>> -	int			vlen;
>>> +	struct kvec		first;
>>> };
>>> 
>>> struct nfsd3_createargs {
>>> diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
>>> index bc29511..d56219d 100644
>>> --- a/fs/nfsd/xdr4.h
>>> +++ b/fs/nfsd/xdr4.h
>>> @@ -390,7 +390,6 @@ struct nfsd4_write {
>>> 	u32		wr_stable_how;      /* request */
>>> 	u32		wr_buflen;          /* request */
>>> 	struct kvec	wr_head;
>>> -	struct page **	wr_pagelist;        /* request */
>>> 
>>> 	u32		wr_bytes_written;   /* response */
>>> 	u32		wr_how_written;     /* response */
>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>> index 1939709..6dc7879 100644
>>> --- a/include/linux/sunrpc/svc.h
>>> +++ b/include/linux/sunrpc/svc.h
>>> @@ -494,6 +494,8 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
>>> void		   svc_reserve(struct svc_rqst *rqstp, int space);
>>> struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
>>> char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
>>> +unsigned int	   svc_fill_write_vector(struct svc_rqst *rqstp,
>>> +					 struct kvec *first, size_t total);
>>> 
>>> #define	RPC_MAX_ADDRBUFLEN	(63U)
>>> 
>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>>> index f19987f..a155e2d 100644
>>> --- a/net/sunrpc/svc.c
>>> +++ b/net/sunrpc/svc.c
>>> @@ -1533,3 +1533,45 @@ u32 svc_max_payload(const struct svc_rqst *rqstp)
>>> 	return max;
>>> }
>>> EXPORT_SYMBOL_GPL(svc_max_payload);
>>> +
>>> +/**
>>> + * svc_fill_write_vector - Construct data argument for VFS write call
>>> + * @rqstp: svc_rqst to operate on
>>> + * @first: buffer containing first section of write payload
>>> + * @total: total number of bytes of write payload
>>> + *
>>> + * Returns the number of elements populated in the data argument array.
>>> + */
>>> +unsigned int svc_fill_write_vector(struct svc_rqst *rqstp, struct kvec *first,
>>> +				   size_t total)
>>> +{
>>> +	struct kvec *vec = rqstp->rq_vec;
>>> +	struct page **pages;
>>> +	unsigned int i;
>>> +
>>> +	/* Some types of transport can present the write payload
>>> +	 * entirely in rq_arg.pages. In this case, @first is empty.
>>> +	 */
>>> +	i = 0;
>>> +	if (first->iov_len) {
>>> +		vec[i].iov_base = first->iov_base;
>>> +		vec[i].iov_len = min_t(size_t, total, first->iov_len);
>>> +		total -= vec[i].iov_len;
>>> +		++i;
>>> +	}
>>> +
>>> +	WARN_ON_ONCE(rqstp->rq_arg.page_base != 0);
>>> +	pages = rqstp->rq_arg.pages;
>>> +	while (total) {
>>> +		vec[i].iov_base = page_address(*pages);
>>> +		vec[i].iov_len = min_t(size_t, total, PAGE_SIZE);
>>> +		total -= vec[i].iov_len;
>>> +		++i;
>>> +
>>> +		++pages;
>>> +	}
>>> +
>>> +	WARN_ON_ONCE(i > ARRAY_SIZE(rqstp->rq_vec));
>>> +	return i;
>>> +}
>>> +EXPORT_SYMBOL_GPL(svc_fill_write_vector);
> 
> --
> Chuck Lever
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html