On Tue, Jul 19, 2022 at 1:21 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > > > On Jul 17, 2022, at 9:15 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote: > >>> On Jul 15, 2022, at 2:44 PM, Anna Schumaker <anna@xxxxxxxxxx> wrote: > >>> > >>> +nfsd4_encode_read_plus_segments(struct nfsd4_compoundres *resp, > >>> + struct nfsd4_read *read, > >>> + unsigned int *segments, u32 *eof) > >>> { > >>> - struct file *file = read->rd_nf->nf_file; > >>> - loff_t data_pos = vfs_llseek(file, read->rd_offset, SEEK_DATA); > >>> - loff_t f_size = i_size_read(file_inode(file)); > >>> - unsigned long count; > >>> - __be32 *p; > >>> + struct xdr_stream *xdr = resp->xdr; > >>> + unsigned int bufpos = xdr->buf->len; > >>> + u64 offset = read->rd_offset; > >>> + struct read_plus_segment segment; > >>> + enum data_content4 pagetype; > >>> + unsigned long maxcount; > >>> + unsigned int pagenum = 0; > >>> + unsigned int pagelen; > >>> + char *vpage, *p; > >>> + __be32 nfserr; > >>> > >>> - if (data_pos == -ENXIO) > >>> - data_pos = f_size; > >>> - else if (data_pos <= read->rd_offset || (data_pos < f_size && data_pos % PAGE_SIZE)) > >>> - return nfsd4_encode_read_plus_data(resp, read, maxcount, eof, &f_size); > >>> - count = data_pos - read->rd_offset; > >>> - > >>> - /* Content type, offset, byte count */ > >>> - p = xdr_reserve_space(resp->xdr, 4 + 8 + 8); > >>> - if (!p) > >>> + /* enough space for a HOLE segment before we switch to the pages */ > >>> + if (!xdr_reserve_space(xdr, 5 * XDR_UNIT)) > >>> return nfserr_resource; > >>> + xdr_commit_encode(xdr); > >>> > >>> - *p++ = htonl(NFS4_CONTENT_HOLE); > >>> - p = xdr_encode_hyper(p, read->rd_offset); > >>> - p = xdr_encode_hyper(p, count); > >>> + maxcount = min_t(unsigned long, read->rd_length, > >>> + (xdr->buf->buflen - xdr->buf->len)); > >>> > >>> - *eof = (read->rd_offset + count) >= f_size; > >>> - *maxcount = min_t(unsigned long, count, *maxcount); > >>> + nfserr = nfsd4_read_plus_readv(resp, read, &maxcount, eof); > >>> + if (nfserr) > >>> + return nfserr; > >>> + > >>> + while (maxcount > 0) { > >>> + vpage = xdr_buf_nth_page_address(xdr->buf, pagenum, &pagelen); > >>> + pagelen = min_t(unsigned int, pagelen, maxcount); > >>> + if (!vpage || pagelen == 0) > >>> + break; > >>> + p = memchr_inv(vpage, 0, pagelen); > >> > >> I'm still not happy about touching every byte in each READ_PLUS > >> payload. I think even though the rest of this work is merge-ready, > >> this is a brute-force mechanism that's OK for a proof of concept > >> but not appropriate for production-ready code. > > > > Seems like a step backwards as it defeats the benefit zero-copy read > > IO paths on the server side.... > > Tom Haynes' vision for READ_PLUS was to eventually replace the > legacy READ operation. That means READ_PLUS(CONTENT_DATA) needs > to be as fast and efficient as plain READ. (It would be great > to use splice reads for CONTENT_DATA if we can!) I remember Bruce thinking we could only use splice reads for the very first segment if it's data, but that was a few years ago so I don't know if anything has changed that would allow spliced reads for all data. > > But I also thought the purpose of READ_PLUS was to help clients > preserve unallocated extents in files during copy operations. > An unallocated extent is not the same as an allocated extent > that has zeroes written into it. IIUC this new logic does not > distinguish between those two cases at all. (And please correct > me if this is really not the goal of READ_PLUS). I wasn't aware of this as a goal of READ_PLUS. As of right now, Linux doesn't really have a way to punch holes into pagecache data, so we and up needing to zero-fill on the client side during decoding. > > I would like to retain precise detection of unallocated extents > in files. Maybe SEEK_HOLE/SEEK_DATA is not how to do that, but > my feeling is that checking for zero bytes is definitely not > the way to do it. Ok. > > > >> I've cc'd linux-fsdevel to see if we can get some more ideas going > >> and move this forward. > >> > >> Another thought I had was to support returning only one or two > >> segments per reply. One CONTENT segment, one HOLE segment, or one > >> of each. Would that be enough to prevent the issues around file > >> updates racing with the construction of the reply? > > > > Before I can make any sort of useful suggestion, I need to have it > > explained to me why we care if the underlying file mapping has > > changed between the read of the data and the SEEK_HOLE trim check, > > because it's not at all clear to me what problem this change is > > actually solving. > > The cover letter for this series calls out generic/091 and > generic/263 -- it mentioned both are failing with NFSv4.2. I've > tried to reproduce failures here, but both passed. Did you check that CONFIG_NFS_V4_2_READ_PLUS=y on the client? We had it disabled due to these tests and the very, very long time servers exporting btrfs take to read already-cached data (see btrfs sections in the wiki page linked in the cover letter). There is also this bugzilla documenting the problem: https://bugzilla.kernel.org/show_bug.cgi?id=215673 > > Anna, can you provide a root-cause analysis of what is failing > in your testing? Maybe a reproducer for us to kick around? The only reproducers I have are the xfstests mentioned. They fail pretty reliably on my setup (linux nfsd exporting xfs). Anna > I'm guessing you might be encountering races because your > usual test environment is virtual, so we need to understand > how timing effects the results. > > > -- > Chuck Lever > > >