> On Sep 7, 2022, at 6:33 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Wed, 2022-09-07 at 20:51 +0000, Chuck Lever III wrote: >> >> >>> On Sep 7, 2022, at 4:37 PM, Anna Schumaker <anna@xxxxxxxxxx> wrote: >>> >>> On Wed, Sep 7, 2022 at 4:29 PM Chuck Lever III >>> <chuck.lever@xxxxxxxxxx> wrote: >>>> >>>> Be sure to Cc: Jeff on these. Thanks! >>>> >>>> >>>>> On Sep 7, 2022, at 3:52 PM, Anna Schumaker <anna@xxxxxxxxxx> >>>>> wrote: >>>>> >>>>> From: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> >>>>> >>>>> Chuck had suggested reverting READ_PLUS so it returns a single >>>>> DATA >>>>> segment covering the requested read range. This prepares the >>>>> server for >>>>> a future "sparse read" function so support can easily be added >>>>> without >>>>> needing to rip out the old READ_PLUS code at the same time. >>>>> >>>>> Signed-off-by: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> >>>>> --- >>>>> fs/nfsd/nfs4xdr.c | 139 +++++++++++---------------------------- >>>>> ------- >>>>> 1 file changed, 32 insertions(+), 107 deletions(-) >>>>> >>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>>>> index 1e9690a061ec..bcc8c385faf2 100644 >>>>> --- a/fs/nfsd/nfs4xdr.c >>>>> +++ b/fs/nfsd/nfs4xdr.c >>>>> @@ -4731,79 +4731,37 @@ nfsd4_encode_offload_status(struct >>>>> nfsd4_compoundres *resp, __be32 nfserr, >>>>> >>>>> static __be32 >>>>> nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp, >>>>> - struct nfsd4_read *read, >>>>> - unsigned long *maxcount, u32 *eof, >>>>> - loff_t *pos) >>>>> + struct nfsd4_read *read) >>>>> { >>>>> - struct xdr_stream *xdr = resp->xdr; >>>>> + bool splice_ok = test_bit(RQ_SPLICE_OK, &resp->rqstp- >>>>>> rq_flags); >>>>> struct file *file = read->rd_nf->nf_file; >>>>> - int starting_len = xdr->buf->len; >>>>> - loff_t hole_pos; >>>>> - __be32 nfserr; >>>>> - __be32 *p, tmp; >>>>> - __be64 tmp64; >>>>> - >>>>> - hole_pos = pos ? *pos : vfs_llseek(file, read->rd_offset, >>>>> SEEK_HOLE); >>>>> - if (hole_pos > read->rd_offset) >>>>> - *maxcount = min_t(unsigned long, *maxcount, >>>>> hole_pos - read->rd_offset); >>>>> - *maxcount = min_t(unsigned long, *maxcount, (xdr->buf- >>>>>> buflen - xdr->buf->len)); >>>>> + struct xdr_stream *xdr = resp->xdr; >>>>> + unsigned long maxcount; >>>>> + __be32 nfserr, *p; >>>>> >>>>> /* Content type, offset, byte count */ >>>>> p = xdr_reserve_space(xdr, 4 + 8 + 4); >>>>> if (!p) >>>>> - return nfserr_resource; >>>>> + return nfserr_io; >>>> >>>> Wouldn't nfserr_rep_too_big be a more appropriate status for >>>> running >>>> off the end of the send buffer? I'm not 100% sure, but I would >>>> expect >>>> that exhausting send buffer space would imply the reply has grown >>>> too >>>> large. >>> >>> I can switch it to that, no problem. >> >> I would like to hear opinions from protocol experts before we go >> with that choice. > > I'd agree that NFS4ERR_REP_TOO_BIG is correct if you're not able to > even return a short read. However if you can return a short read, then > that's better than returning an error. Many thanks for reviewing! In general, I might want to convert all NFSv4 encoders to return either RESOURCE or REP_TOO_BIG when xdr_reserve_space() fails. I can post a patch or two that makes that attempt so the special cases can be sorted on the mailing list. > It looks to me as if this function can bit hit in both cases, so > perhaps some care is in order. Intriguing idea. For READ, if xdr_reserve_space() fails, that's all she wrote. I think all these calls happen before the data payload is encoded. For READ_PLUS, if xdr_reserve_space() fails, it might be possible to use xdr_truncate_encode() or simply start over with a limit on the number of segments to be encoded. We're not really there yet, as currently we want to return just a single segment at this point. I feel the question is whether it's practical or a frequent enough occurrence to bother with the special cases. Encoding a READ_PLUS response is already challenging. >>>>> + if (resp->xdr->buf->page_len && splice_ok) { >>>>> + WARN_ON_ONCE(splice_ok); >>>>> + return nfserr_io; >>>>> + } >>>> >>>> I wish I understood why this test was needed. It seems to have >>>> been >>>> copied and pasted from historic code into nfsd4_encode_read(), >>>> and >>>> there have been recent mechanical changes to it, but there's no >>>> comment explaining it there... >>> >>> Yeah, I saw this was in the read code and assumed it was an >>> important >>> check so I added it here too. >>>> >>>> In any event, this seems to be checking for a server software >>>> bug, >>>> so maybe this should return nfserr_serverfault. Oddly that status >>>> code isn't defined yet. >>> >>> Do you want me to add that code and return it in this patch? >> >> Sure. Make that a predecessor patch and fix up the code in >> nfsd4_encode_read() before using it for READ_PLUS in this patch. >> >> Suggested patch description: >> >> RESOURCE is the wrong status code here because >> >> a) This encoder is shared between NFSv4.0 and NFSv4.1+, and >> NFSv4.1+ doesn't support RESOURCE, and > > Is it? I thought it was READ_PLUS only. That's NFSv4.2 or higher. For the initial patch that adds nfserr_serverfault, I was speaking only about nfsd4_encode_read(), which is shared amongst all minor versions. That's where the page_len && splice_ok test originally came from. Sorry that wasn't clear. > If the encoder is to be shared with both READ and READ_PLUS, then > perhaps it might be wiser to choose a name other than > nfsd4_encode_read_plus_data()? Although the encoded streams are very similar, there are still separate encoders for a READ_PLUS data segment and a full READ response. The common pieces for both of those operations are nfsd4_encode_readv() and nfsd4_encode_splice_read(). IIUC nfsd4_encode_read_plus_data() has to deal with encoding the the NFS4_CONTENT_DATA enumerator -- it handles a single READ_PLUS segment. nfsd4_encode_read() encodes one complete READ response. I'm comfortable with the current function names. >> b) That status code might cause the client to retry, in which >> case it will get the same failure because this check seems >> to be looking for a server-side coding mistake. >> >> >>>> Do you have some performance results for v2? >>> >>> Not yet, I have it running now so hopefully I'll have something >>> ready >>> by tomorrow morning. >> >> Great, thanks! >> >> >>> Anna >>>> >>>> >>>>> - read->rd_vlen = xdr_reserve_space_vec(xdr, resp->rqstp- >>>>>> rq_vec, *maxcount); >>>>> - if (read->rd_vlen < 0) >>>>> - return nfserr_resource; >>>>> + maxcount = min_t(unsigned long, read->rd_length, >>>>> + (xdr->buf->buflen - xdr->buf->len)); >>>>> >>>>> - nfserr = nfsd_readv(resp->rqstp, read->rd_fhp, file, >>>>> read->rd_offset, >>>>> - resp->rqstp->rq_vec, read->rd_vlen, >>>>> maxcount, eof); >>>>> + if (file->f_op->splice_read && splice_ok) >>>>> + nfserr = nfsd4_encode_splice_read(resp, read, >>>>> file, maxcount); >>>>> + else >>>>> + nfserr = nfsd4_encode_readv(resp, read, file, >>>>> maxcount); >>>>> if (nfserr) >>>>> return nfserr; >>>>> - xdr_truncate_encode(xdr, starting_len + 16 + >>>>> xdr_align_size(*maxcount)); >>>>> >>>>> - tmp = htonl(NFS4_CONTENT_DATA); >>>>> - write_bytes_to_xdr_buf(xdr->buf, starting_len, >>>>> &tmp, 4); >>>>> - tmp64 = cpu_to_be64(read->rd_offset); >>>>> - write_bytes_to_xdr_buf(xdr->buf, starting_len + 4, >>>>> &tmp64, 8); >>>>> - tmp = htonl(*maxcount); >>>>> - write_bytes_to_xdr_buf(xdr->buf, starting_len + 12, >>>>> &tmp, 4); >>>>> - >>>>> - tmp = xdr_zero; >>>>> - write_bytes_to_xdr_buf(xdr->buf, starting_len + 16 + >>>>> *maxcount, &tmp, >>>>> - xdr_pad_size(*maxcount)); >>>>> - return nfs_ok; >>>>> -} >>>>> - >>>>> -static __be32 >>>>> -nfsd4_encode_read_plus_hole(struct nfsd4_compoundres *resp, >>>>> - struct nfsd4_read *read, >>>>> - unsigned long *maxcount, u32 *eof) >>>>> -{ >>>>> - struct file *file = read->rd_nf->nf_file; >>>>> - loff_t data_pos = vfs_llseek(file, read->rd_offset, >>>>> SEEK_DATA); >>>>> - loff_t f_size = i_size_read(file_inode(file)); >>>>> - unsigned long count; >>>>> - __be32 *p; >>>>> - >>>>> - if (data_pos == -ENXIO) >>>>> - data_pos = f_size; >>>>> - else if (data_pos <= read->rd_offset || (data_pos < >>>>> f_size && data_pos % PAGE_SIZE)) >>>>> - return nfsd4_encode_read_plus_data(resp, read, >>>>> maxcount, eof, &f_size); >>>>> - count = data_pos - read->rd_offset; >>>>> - >>>>> - /* Content type, offset, byte count */ >>>>> - p = xdr_reserve_space(resp->xdr, 4 + 8 + 8); >>>>> - if (!p) >>>>> - return nfserr_resource; >>>>> - >>>>> - *p++ = htonl(NFS4_CONTENT_HOLE); >>>>> + *p++ = cpu_to_be32(NFS4_CONTENT_DATA); >>>>> p = xdr_encode_hyper(p, read->rd_offset); >>>>> - p = xdr_encode_hyper(p, count); >>>>> + *p = cpu_to_be32(read->rd_length); >>>>> >>>>> - *eof = (read->rd_offset + count) >= f_size; >>>>> - *maxcount = min_t(unsigned long, count, *maxcount); >>>>> return nfs_ok; >>>>> } >>>>> >>>>> @@ -4811,69 +4769,36 @@ static __be32 >>>>> nfsd4_encode_read_plus(struct nfsd4_compoundres *resp, __be32 >>>>> nfserr, >>>>> struct nfsd4_read *read) >>>>> { >>>>> - unsigned long maxcount, count; >>>>> + struct file *file = read->rd_nf->nf_file; >>>>> struct xdr_stream *xdr = resp->xdr; >>>>> - struct file *file; >>>>> int starting_len = xdr->buf->len; >>>>> - int last_segment = xdr->buf->len; >>>>> - int segments = 0; >>>>> - __be32 *p, tmp; >>>>> - bool is_data; >>>>> - loff_t pos; >>>>> - u32 eof; >>>>> + u32 segments = 0; >>>>> + __be32 *p; >>>>> >>>>> if (nfserr) >>>>> return nfserr; >>>>> - file = read->rd_nf->nf_file; >>>>> >>>>> /* eof flag, segment count */ >>>>> p = xdr_reserve_space(xdr, 4 + 4); >>>>> if (!p) >>>>> - return nfserr_resource; >>>>> + return nfserr_io; >>>>> xdr_commit_encode(xdr); >>>>> >>>>> - maxcount = min_t(unsigned long, read->rd_length, >>>>> - (xdr->buf->buflen - xdr->buf->len)); >>>>> - count = maxcount; >>>>> - >>>>> - eof = read->rd_offset >= i_size_read(file_inode(file)); >>>>> - if (eof) >>>>> + read->rd_eof = read->rd_offset >= >>>>> i_size_read(file_inode(file)); >>>>> + if (read->rd_eof) >>>>> goto out; >>>>> >>>>> - pos = vfs_llseek(file, read->rd_offset, SEEK_HOLE); >>>>> - is_data = pos > read->rd_offset; >>>>> - >>>>> - while (count > 0 && !eof) { >>>>> - maxcount = count; >>>>> - if (is_data) >>>>> - nfserr = >>>>> nfsd4_encode_read_plus_data(resp, read, &maxcount, &eof, >>>>> - segments == 0 ? >>>>> &pos : NULL); >>>>> - else >>>>> - nfserr = >>>>> nfsd4_encode_read_plus_hole(resp, read, &maxcount, &eof); >>>>> - if (nfserr) >>>>> - goto out; >>>>> - count -= maxcount; >>>>> - read->rd_offset += maxcount; >>>>> - is_data = !is_data; >>>>> - last_segment = xdr->buf->len; >>>>> - segments++; >>>>> - } >>>>> - >>>>> -out: >>>>> - if (nfserr && segments == 0) >>>>> + nfserr = nfsd4_encode_read_plus_data(resp, read); >>>>> + if (nfserr) { >>>>> xdr_truncate_encode(xdr, starting_len); >>>>> - else { >>>>> - if (nfserr) { >>>>> - xdr_truncate_encode(xdr, last_segment); >>>>> - nfserr = nfs_ok; >>>>> - eof = 0; >>>>> - } >>>>> - tmp = htonl(eof); >>>>> - write_bytes_to_xdr_buf(xdr->buf, >>>>> starting_len, &tmp, 4); >>>>> - tmp = htonl(segments); >>>>> - write_bytes_to_xdr_buf(xdr->buf, starting_len + >>>>> 4, &tmp, 4); >>>>> + return nfserr; >>>>> } >>>>> >>>>> + segments++; >>>>> + >>>>> +out: >>>>> + p = xdr_encode_bool(p, read->rd_eof); >>>>> + *p = cpu_to_be32(segments); >>>>> return nfserr; >>>>> } >>>>> >>>>> -- >>>>> 2.37.2 >>>>> >>>> >>>> -- >>>> Chuck Lever >> >> -- >> Chuck Lever >> >> >> > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx -- Chuck Lever