Re: Regression in xfstests on tmpfs-backed NFS exports

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 8 Apr 2022 16:10:33 +0000

> On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> 
> On Thu, 7 Apr 2022, Chuck Lever III wrote:
>>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
>>> 
>>> But I can sit here and try to guess.  I notice fs/nfsd checks
>>> file->f_op->splice_read, and employs fallback if not available:
>>> if you have time, please try rerunning those xfstests on an -rc1
>>> kernel, but with mm/shmem.c's .splice_read line commented out.
>>> My guess is that will then pass the tests, and we shall know more.
>> 
>> This seemed like the most probative next step, so I commented
>> out the .splice_read call-out in mm/shmem.c and ran the tests
>> again. Yes, that change enables the fsx-related tests to pass
>> as expected.
> 
> Great, thank you for trying that.
> 
>> 
>>> What could be going wrong there?  I've thought of two possibilities.
>>> A minor, hopefully easily fixed, issue would be if fs/nfsd has
>>> trouble with seeing the same page twice in a row: since tmpfs is
>>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
>>> caught sight of code which looks to see if the latest page is the
>>> same as the one before.  It's easy to imagine that might go wrong.
>> 
>> Are you referring to this function in fs/nfsd/vfs.c ?
> 
> I think that was it, didn't pay much attention.
> 
>> 
>> 847 static int
>> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>> 849                   struct splice_desc *sd)
>> 850 {
>> 851         struct svc_rqst *rqstp = sd->u.data;
>> 852         struct page **pp = rqstp->rq_next_page;
>> 853         struct page *page = buf->page;
>> 854 
>> 855         if (rqstp->rq_res.page_len == 0) {
>> 856                 svc_rqst_replace_page(rqstp, page);
>> 857                 rqstp->rq_res.page_base = buf->offset;
>> 858         } else if (page != pp[-1]) {
>> 859                 svc_rqst_replace_page(rqstp, page);
>> 860         }
>> 861         rqstp->rq_res.page_len += sd->len;
>> 862 
>> 863         return sd->len;
>> 864 }
>> 
>> rq_next_page should point to the first unused element of
>> rqstp->rq_pages, so IIUC that check is looking for the
>> final page that is part of the READ payload.
>> 
>> But that does suggest that if page -> ZERO_PAGE and so does
>> pp[-1], then svc_rqst_replace_page() would not be invoked.

To put a little more color on this, I think the idea here
is to prevent releasing the same page twice. It might be
possible that NFSD can add the same page to the rq_pages
array more than once, and we don't want to do a double
put_page().

The only time I can think this might happen is if the
READ payload is partially contained in the page that
contains the NFS header. I'm not sure that can ever
happen these days.

> I still haven't studied the logic there: Mark's input made it clear
> that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
> there could be expectations of uniqueness in other places too.

So far I haven't seen an issue with skb_can_coalesce().
I will keep an eye out for that.

>>> A more difficult issue would be, if fsx is racing writes and reads,
>>> in a way that it can guarantee the correct result, but that correct
>>> result is no longer delivered: because the writes go into freshly
>>> allocated tmpfs cache pages, while reads are still delivering
>>> stale ZERO_PAGEs from the pipe.  I'm hazy on the guarantees there.
>>> 
>>> But unless someone has time to help out, we're heading for a revert.
> 
> We might be able to avoid that revert, and go the whole way to using
> iov_iter_zero() instead.  But the significant slowness of clear_user()
> relative to copy to user, on x86 at least, does ask for a hybrid.
> 
> Suggested patch below, on top of 5.18-rc1, passes my own testing:
> but will it pass yours?  It seems to me safe, and as fast as before,
> but we don't know yet if this iov_iter_zero() works right for you.
> Chuck, please give it a go and let us know.

Applied to stock v5.18-rc1. The tests pass as expected.

> (Don't forget to restore mm/shmem.c's .splice_read first!  And if
> this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
> in the same patch, fixing the other regression, without recourse to
> #ifdefs or arch mods.)
> 
> Thanks!
> Hugh
> 
> --- 5.18-rc1/mm/shmem.c
> +++ linux/mm/shmem.c
> @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> 		pgoff_t end_index;
> 		unsigned long nr, ret;
> 		loff_t i_size = i_size_read(inode);
> -		bool got_page;
> 
> 		end_index = i_size >> PAGE_SHIFT;
> 		if (index > end_index)
> @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> 			 */
> 			if (!offset)
> 				mark_page_accessed(page);
> -			got_page = true;
> +			/*
> +			 * Ok, we have the page, and it's up-to-date, so
> +			 * now we can copy it to user space...
> +			 */
> +			ret = copy_page_to_iter(page, offset, nr, to);
> +			put_page(page);
> +
> +		} else if (iter_is_iovec(to)) {
> +			/*
> +			 * Copy to user tends to be so well optimized, but
> +			 * clear_user() not so much, that it is noticeably
> +			 * faster to copy the zero page instead of clearing.
> +			 */
> +			ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
> 		} else {
> -			page = ZERO_PAGE(0);
> -			got_page = false;
> +			/*
> +			 * But submitting the same page twice in a row to
> +			 * splice() - or others? - can result in confusion:
> +			 * so don't attempt that optimization on pipes etc.
> +			 */
> +			ret = iov_iter_zero(nr, to);
> 		}
> 
> -		/*
> -		 * Ok, we have the page, and it's up-to-date, so
> -		 * now we can copy it to user space...
> -		 */
> -		ret = copy_page_to_iter(page, offset, nr, to);
> 		retval += ret;
> 		offset += ret;
> 		index += offset >> PAGE_SHIFT;
> 		offset &= ~PAGE_MASK;
> 
> -		if (got_page)
> -			put_page(page);
> 		if (!iov_iter_count(to))
> 			break;
> 		if (ret < nr) {

--
Chuck Lever