Re: [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Tue, 19 Jul 2022 20:47:34 +0000

> On Jul 19, 2022, at 4:24 PM, Anna Schumaker <anna@xxxxxxxxxx> wrote:
> 
> On Tue, Jul 19, 2022 at 1:21 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>> 
>> 
>> 
>>> On Jul 17, 2022, at 9:15 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote:
>>>>> On Jul 15, 2022, at 2:44 PM, Anna Schumaker <anna@xxxxxxxxxx> wrote:
>>>>> 
>>>>> +nfsd4_encode_read_plus_segments(struct nfsd4_compoundres *resp,
>>>>> + struct nfsd4_read *read,
>>>>> + unsigned int *segments, u32 *eof)
>>>>> {
>>>>> - struct file *file = read->rd_nf->nf_file;
>>>>> - loff_t data_pos = vfs_llseek(file, read->rd_offset, SEEK_DATA);
>>>>> - loff_t f_size = i_size_read(file_inode(file));
>>>>> - unsigned long count;
>>>>> - __be32 *p;
>>>>> + struct xdr_stream *xdr = resp->xdr;
>>>>> + unsigned int bufpos = xdr->buf->len;
>>>>> + u64 offset = read->rd_offset;
>>>>> + struct read_plus_segment segment;
>>>>> + enum data_content4 pagetype;
>>>>> + unsigned long maxcount;
>>>>> + unsigned int pagenum = 0;
>>>>> + unsigned int pagelen;
>>>>> + char *vpage, *p;
>>>>> + __be32 nfserr;
>>>>> 
>>>>> - if (data_pos == -ENXIO)
>>>>> - data_pos = f_size;
>>>>> - else if (data_pos <= read->rd_offset || (data_pos < f_size && data_pos % PAGE_SIZE))
>>>>> - return nfsd4_encode_read_plus_data(resp, read, maxcount, eof, &f_size);
>>>>> - count = data_pos - read->rd_offset;
>>>>> -
>>>>> - /* Content type, offset, byte count */
>>>>> - p = xdr_reserve_space(resp->xdr, 4 + 8 + 8);
>>>>> - if (!p)
>>>>> + /* enough space for a HOLE segment before we switch to the pages */
>>>>> + if (!xdr_reserve_space(xdr, 5 * XDR_UNIT))
>>>>> return nfserr_resource;
>>>>> + xdr_commit_encode(xdr);
>>>>> 
>>>>> - *p++ = htonl(NFS4_CONTENT_HOLE);
>>>>> - p = xdr_encode_hyper(p, read->rd_offset);
>>>>> - p = xdr_encode_hyper(p, count);
>>>>> + maxcount = min_t(unsigned long, read->rd_length,
>>>>> + (xdr->buf->buflen - xdr->buf->len));
>>>>> 
>>>>> - *eof = (read->rd_offset + count) >= f_size;
>>>>> - *maxcount = min_t(unsigned long, count, *maxcount);
>>>>> + nfserr = nfsd4_read_plus_readv(resp, read, &maxcount, eof);
>>>>> + if (nfserr)
>>>>> + return nfserr;
>>>>> +
>>>>> + while (maxcount > 0) {
>>>>> + vpage = xdr_buf_nth_page_address(xdr->buf, pagenum, &pagelen);
>>>>> + pagelen = min_t(unsigned int, pagelen, maxcount);
>>>>> + if (!vpage || pagelen == 0)
>>>>> + break;
>>>>> + p = memchr_inv(vpage, 0, pagelen);
>>>> 
>>>> I'm still not happy about touching every byte in each READ_PLUS
>>>> payload. I think even though the rest of this work is merge-ready,
>>>> this is a brute-force mechanism that's OK for a proof of concept
>>>> but not appropriate for production-ready code.
>>> 
>>> Seems like a step backwards as it defeats the benefit zero-copy read
>>> IO paths on the server side....
>> 
>> Tom Haynes' vision for READ_PLUS was to eventually replace the
>> legacy READ operation. That means READ_PLUS(CONTENT_DATA) needs
>> to be as fast and efficient as plain READ. (It would be great
>> to use splice reads for CONTENT_DATA if we can!)
> 
> I remember Bruce thinking we could only use splice reads for the very
> first segment if it's data, but that was a few years ago so I don't
> know if anything has changed that would allow spliced reads for all
> data.

That might be a limitation of the current NFSv4 READ implementation
rather than something we always have to live with. But yes, I think
it's true that splice read only works for the first READ in a
COMPOUND, and READ_PLUS CONTENT_DATA segments are a similar
situation.

Even so, IMO enabling splice read on the first segment would cover
important use cases, in particular the case when there are no holes.

'Twould be interesting to add a metric to see how often READ_PLUS
returns a hole. Just a thought.

>> But I also thought the purpose of READ_PLUS was to help clients
>> preserve unallocated extents in files during copy operations.
>> An unallocated extent is not the same as an allocated extent
>> that has zeroes written into it. IIUC this new logic does not
>> distinguish between those two cases at all. (And please correct
>> me if this is really not the goal of READ_PLUS).
> 
> I wasn't aware of this as a goal of READ_PLUS. As of right now, Linux
> doesn't really have a way to punch holes into pagecache data, so we
> and up needing to zero-fill on the client side during decoding.

Again, that might not always be the case? But OK, I'll table that.

>> I would like to retain precise detection of unallocated extents
>> in files. Maybe SEEK_HOLE/SEEK_DATA is not how to do that, but
>> my feeling is that checking for zero bytes is definitely not
>> the way to do it.
> 
> Ok.
>> 
>> 
>>>> I've cc'd linux-fsdevel to see if we can get some more ideas going
>>>> and move this forward.
>>>> 
>>>> Another thought I had was to support returning only one or two
>>>> segments per reply. One CONTENT segment, one HOLE segment, or one
>>>> of each. Would that be enough to prevent the issues around file
>>>> updates racing with the construction of the reply?
>>> 
>>> Before I can make any sort of useful suggestion, I need to have it
>>> explained to me why we care if the underlying file mapping has
>>> changed between the read of the data and the SEEK_HOLE trim check,
>>> because it's not at all clear to me what problem this change is
>>> actually solving.
>> 
>> The cover letter for this series calls out generic/091 and
>> generic/263 -- it mentioned both are failing with NFSv4.2. I've
>> tried to reproduce failures here, but both passed.
> 
> Did you check that CONFIG_NFS_V4_2_READ_PLUS=y on the client? We had
> it disabled due to these tests and the very, very long time servers
> exporting btrfs take to read already-cached data (see btrfs sections
> in the wiki page linked in the cover letter).

Confirmed: that variable is set in my client's booted kernel.

> There is also this bugzilla documenting the problem:
> https://bugzilla.kernel.org/show_bug.cgi?id=215673

OK, then let's add

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215673

to this patch when you resubmit. (I used to use BugLink:
but I gather Linus doesn't like that tag name).

>> Anna, can you provide a root-cause analysis of what is failing
>> in your testing? Maybe a reproducer for us to kick around?
> 
> The only reproducers I have are the xfstests mentioned. They fail
> pretty reliably on my setup (linux nfsd exporting xfs).

Then, the first order of business is to understand why you
can reproduce the failure easily but I cannot. (Hrm, I think
I did try to reproduce on xfs, but I'll have to check).

I'd like to see READ_PLUS working well in NFSD, so I'll
try to be as helpful as I can. There will be a learning
curve ;-)

> Anna
> 
>> I'm guessing you might be encountering races because your
>> usual test environment is virtual, so we need to understand
>> how timing effects the results.
>> 
>> 
>> --
>> Chuck Lever

--
Chuck Lever