> On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > On Thu, 7 Apr 2022, Chuck Lever III wrote: >>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote: >>> >>> But I can sit here and try to guess. I notice fs/nfsd checks >>> file->f_op->splice_read, and employs fallback if not available: >>> if you have time, please try rerunning those xfstests on an -rc1 >>> kernel, but with mm/shmem.c's .splice_read line commented out. >>> My guess is that will then pass the tests, and we shall know more. >> >> This seemed like the most probative next step, so I commented >> out the .splice_read call-out in mm/shmem.c and ran the tests >> again. Yes, that change enables the fsx-related tests to pass >> as expected. > > Great, thank you for trying that. > >> >>> What could be going wrong there? I've thought of two possibilities. >>> A minor, hopefully easily fixed, issue would be if fs/nfsd has >>> trouble with seeing the same page twice in a row: since tmpfs is >>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I >>> caught sight of code which looks to see if the latest page is the >>> same as the one before. It's easy to imagine that might go wrong. >> >> Are you referring to this function in fs/nfsd/vfs.c ? > > I think that was it, didn't pay much attention. This code seems to have been the issue. I added a little test to see if @page pointed to ZERO_PAGE(0) and now the tests pass as expected. >> 847 static int >> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, >> 849 struct splice_desc *sd) >> 850 { >> 851 struct svc_rqst *rqstp = sd->u.data; >> 852 struct page **pp = rqstp->rq_next_page; >> 853 struct page *page = buf->page; >> 854 >> 855 if (rqstp->rq_res.page_len == 0) { >> 856 svc_rqst_replace_page(rqstp, page); >> 857 rqstp->rq_res.page_base = buf->offset; >> 858 } else if (page != pp[-1]) { >> 859 svc_rqst_replace_page(rqstp, page); >> 860 } >> 861 rqstp->rq_res.page_len += sd->len; >> 862 >> 863 return sd->len; >> 864 } >> >> rq_next_page should point to the first unused element of >> rqstp->rq_pages, so IIUC that check is looking for the >> final page that is part of the READ payload. >> >> But that does suggest that if page -> ZERO_PAGE and so does >> pp[-1], then svc_rqst_replace_page() would not be invoked. > > I still haven't studied the logic there: Mark's input made it clear > that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly, > there could be expectations of uniqueness in other places too. I can't really attest to Mark's comment, but... After studying nfsd_splice_actor() I can't see any reason except cleverness and technical debt for this particular check. I have a patch that removes the check and simplifies this function that I'm testing now -- it seems to be a reasonable clean-up whether you keep 56a8c8eb1eaf or choose to revert it. >>> A more difficult issue would be, if fsx is racing writes and reads, >>> in a way that it can guarantee the correct result, but that correct >>> result is no longer delivered: because the writes go into freshly >>> allocated tmpfs cache pages, while reads are still delivering >>> stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there. >>> >>> But unless someone has time to help out, we're heading for a revert. > > We might be able to avoid that revert, and go the whole way to using > iov_iter_zero() instead. But the significant slowness of clear_user() > relative to copy to user, on x86 at least, does ask for a hybrid. > > Suggested patch below, on top of 5.18-rc1, passes my own testing: > but will it pass yours? It seems to me safe, and as fast as before, > but we don't know yet if this iov_iter_zero() works right for you. > Chuck, please give it a go and let us know. > > (Don't forget to restore mm/shmem.c's .splice_read first! And if > this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0)) > in the same patch, fixing the other regression, without recourse to > #ifdefs or arch mods.) Sure, I will try this out first thing tomorrow. One thing that occurs to me is that for NFS/RDMA, having a page full of zeroes that is already DMA-mapped would be a nice optimization on the sender side (on the client for an NFS WRITE and on the server for an NFS READ). The transport would have to set up a scatter-gather list containing a bunch of entries that reference the same page... </musing> > Thanks! > Hugh > > --- 5.18-rc1/mm/shmem.c > +++ linux/mm/shmem.c > @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) > pgoff_t end_index; > unsigned long nr, ret; > loff_t i_size = i_size_read(inode); > - bool got_page; > > end_index = i_size >> PAGE_SHIFT; > if (index > end_index) > @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) > */ > if (!offset) > mark_page_accessed(page); > - got_page = true; > + /* > + * Ok, we have the page, and it's up-to-date, so > + * now we can copy it to user space... > + */ > + ret = copy_page_to_iter(page, offset, nr, to); > + put_page(page); > + > + } else if (iter_is_iovec(to)) { > + /* > + * Copy to user tends to be so well optimized, but > + * clear_user() not so much, that it is noticeably > + * faster to copy the zero page instead of clearing. > + */ > + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to); > } else { > - page = ZERO_PAGE(0); > - got_page = false; > + /* > + * But submitting the same page twice in a row to > + * splice() - or others? - can result in confusion: > + * so don't attempt that optimization on pipes etc. > + */ > + ret = iov_iter_zero(nr, to); > } > > - /* > - * Ok, we have the page, and it's up-to-date, so > - * now we can copy it to user space... > - */ > - ret = copy_page_to_iter(page, offset, nr, to); > retval += ret; > offset += ret; > index += offset >> PAGE_SHIFT; > offset &= ~PAGE_MASK; > > - if (got_page) > - put_page(page); > if (!iov_iter_count(to)) > break; > if (ret < nr) { -- Chuck Lever