Re: [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation

Anna Schumaker <anna@xxxxxxxxxx> · Tue, 19 Jul 2022 16:24:18 -0400

On Tue, Jul 19, 2022 at 1:21 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>
>
>
> > On Jul 17, 2022, at 9:15 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote:
> >>> On Jul 15, 2022, at 2:44 PM, Anna Schumaker <anna@xxxxxxxxxx> wrote:
> >>>
> >>> +nfsd4_encode_read_plus_segments(struct nfsd4_compoundres *resp,
> >>> +                           struct nfsd4_read *read,
> >>> +                           unsigned int *segments, u32 *eof)
> >>> {
> >>> -   struct file *file = read->rd_nf->nf_file;
> >>> -   loff_t data_pos = vfs_llseek(file, read->rd_offset, SEEK_DATA);
> >>> -   loff_t f_size = i_size_read(file_inode(file));
> >>> -   unsigned long count;
> >>> -   __be32 *p;
> >>> +   struct xdr_stream *xdr = resp->xdr;
> >>> +   unsigned int bufpos = xdr->buf->len;
> >>> +   u64 offset = read->rd_offset;
> >>> +   struct read_plus_segment segment;
> >>> +   enum data_content4 pagetype;
> >>> +   unsigned long maxcount;
> >>> +   unsigned int pagenum = 0;
> >>> +   unsigned int pagelen;
> >>> +   char *vpage, *p;
> >>> +   __be32 nfserr;
> >>>
> >>> -   if (data_pos == -ENXIO)
> >>> -           data_pos = f_size;
> >>> -   else if (data_pos <= read->rd_offset || (data_pos < f_size && data_pos % PAGE_SIZE))
> >>> -           return nfsd4_encode_read_plus_data(resp, read, maxcount, eof, &f_size);
> >>> -   count = data_pos - read->rd_offset;
> >>> -
> >>> -   /* Content type, offset, byte count */
> >>> -   p = xdr_reserve_space(resp->xdr, 4 + 8 + 8);
> >>> -   if (!p)
> >>> +   /* enough space for a HOLE segment before we switch to the pages */
> >>> +   if (!xdr_reserve_space(xdr, 5 * XDR_UNIT))
> >>>             return nfserr_resource;
> >>> +   xdr_commit_encode(xdr);
> >>>
> >>> -   *p++ = htonl(NFS4_CONTENT_HOLE);
> >>> -   p = xdr_encode_hyper(p, read->rd_offset);
> >>> -   p = xdr_encode_hyper(p, count);
> >>> +   maxcount = min_t(unsigned long, read->rd_length,
> >>> +                    (xdr->buf->buflen - xdr->buf->len));
> >>>
> >>> -   *eof = (read->rd_offset + count) >= f_size;
> >>> -   *maxcount = min_t(unsigned long, count, *maxcount);
> >>> +   nfserr = nfsd4_read_plus_readv(resp, read, &maxcount, eof);
> >>> +   if (nfserr)
> >>> +           return nfserr;
> >>> +
> >>> +   while (maxcount > 0) {
> >>> +           vpage = xdr_buf_nth_page_address(xdr->buf, pagenum, &pagelen);
> >>> +           pagelen = min_t(unsigned int, pagelen, maxcount);
> >>> +           if (!vpage || pagelen == 0)
> >>> +                   break;
> >>> +           p = memchr_inv(vpage, 0, pagelen);
> >>
> >> I'm still not happy about touching every byte in each READ_PLUS
> >> payload. I think even though the rest of this work is merge-ready,
> >> this is a brute-force mechanism that's OK for a proof of concept
> >> but not appropriate for production-ready code.
> >
> > Seems like a step backwards as it defeats the benefit zero-copy read
> > IO paths on the server side....
>
> Tom Haynes' vision for READ_PLUS was to eventually replace the
> legacy READ operation. That means READ_PLUS(CONTENT_DATA) needs
> to be as fast and efficient as plain READ. (It would be great
> to use splice reads for CONTENT_DATA if we can!)

I remember Bruce thinking we could only use splice reads for the very
first segment if it's data, but that was a few years ago so I don't
know if anything has changed that would allow spliced reads for all
data.

>
> But I also thought the purpose of READ_PLUS was to help clients
> preserve unallocated extents in files during copy operations.
> An unallocated extent is not the same as an allocated extent
> that has zeroes written into it. IIUC this new logic does not
> distinguish between those two cases at all. (And please correct
> me if this is really not the goal of READ_PLUS).

I wasn't aware of this as a goal of READ_PLUS. As of right now, Linux
doesn't really have a way to punch holes into pagecache data, so we
and up needing to zero-fill on the client side during decoding.

>
> I would like to retain precise detection of unallocated extents
> in files. Maybe SEEK_HOLE/SEEK_DATA is not how to do that, but
> my feeling is that checking for zero bytes is definitely not
> the way to do it.

Ok.
>
>
> >> I've cc'd linux-fsdevel to see if we can get some more ideas going
> >> and move this forward.
> >>
> >> Another thought I had was to support returning only one or two
> >> segments per reply. One CONTENT segment, one HOLE segment, or one
> >> of each. Would that be enough to prevent the issues around file
> >> updates racing with the construction of the reply?
> >
> > Before I can make any sort of useful suggestion, I need to have it
> > explained to me why we care if the underlying file mapping has
> > changed between the read of the data and the SEEK_HOLE trim check,
> > because it's not at all clear to me what problem this change is
> > actually solving.
>
> The cover letter for this series calls out generic/091 and
> generic/263 -- it mentioned both are failing with NFSv4.2. I've
> tried to reproduce failures here, but both passed.

Did you check that CONFIG_NFS_V4_2_READ_PLUS=y on the client? We had
it disabled due to these tests and the very, very long time servers
exporting btrfs take to read already-cached data (see btrfs sections
in the wiki page linked in the cover letter).

There is also this bugzilla documenting the problem:
https://bugzilla.kernel.org/show_bug.cgi?id=215673

>
> Anna, can you provide a root-cause analysis of what is failing
> in your testing? Maybe a reproducer for us to kick around?

The only reproducers I have are the xfstests mentioned. They fail
pretty reliably on my setup (linux nfsd exporting xfs).

Anna

> I'm guessing you might be encountering races because your
> usual test environment is virtual, so we need to understand
> how timing effects the results.
>
>
> --
> Chuck Lever
>
>
>