Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 21 May 2024 20:00:20 -0700

On Tue, May 21, 2024 at 12:38:30PM +1000, Dave Chinner wrote:
> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > 
> > When truncating a realtime file unaligned to a shorter size,
> > xfs_setattr_size() only flush the EOF page before zeroing out, and
> > xfs_truncate_page() also only zeros the EOF block. This could expose
> > stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
> > a write operation").
> > 
> > If the sb_rextsize is bigger than one block, and we have a realtime
> > inode that contains a long enough written extent. If we unaligned
> > truncate into the middle of this extent, xfs_itruncate_extents() could
> > split the extent and align the it's tail to sb_rextsize, there maybe
> > have more than one blocks more between the end of the file. Since
> > xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
> > value, so it may leftover some blocks contains stale data that could be
> > exposed if we append write it over a long enough distance later.
> > 
> > xfs_truncate_page() should flush, zeros out the entire rtextsize range,
> > and make sure the entire zeroed range have been flushed to disk before
> > updating the inode size.
> > 
> > Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
> > Reported-by: Chandan Babu R <chandanbabu@xxxxxxxxxx>
> > Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@xxxxxxxxxxxxxxx
> > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
> >  fs/xfs/xfs_iops.c  | 10 ----------
> >  2 files changed, 31 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 4958cc3337bc..fc379450fe74 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -1466,12 +1466,39 @@ xfs_truncate_page(
> >  	loff_t			pos,
> >  	bool			*did_zero)
> >  {
> > +	struct xfs_mount	*mp = ip->i_mount;
> >  	struct inode		*inode = VFS_I(ip);
> >  	unsigned int		blocksize = i_blocksize(inode);
> > +	int			error;
> > +
> > +	if (XFS_IS_REALTIME_INODE(ip))
> > +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
> > +
> > +	/*
> > +	 * iomap won't detect a dirty page over an unwritten block (or a
> > +	 * cow block over a hole) and subsequently skips zeroing the
> > +	 * newly post-EOF portion of the page. Flush the new EOF to
> > +	 * convert the block before the pagecache truncate.
> > +	 */
> > +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
> > +					     roundup_64(pos, blocksize));
> > +	if (error)
> > +		return error;
> >  
> >  	if (IS_DAX(inode))
> > -		return dax_truncate_page(inode, pos, blocksize, did_zero,
> > -					&xfs_dax_write_iomap_ops);
> > -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
> > -				   &xfs_buffered_write_iomap_ops);
> > +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
> > +					  &xfs_dax_write_iomap_ops);
> > +	else
> > +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
> > +					    &xfs_buffered_write_iomap_ops);
> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * Write back path won't write dirty blocks post EOF folio,
> > +	 * flush the entire zeroed range before updating the inode
> > +	 * size.
> > +	 */
> > +	return filemap_write_and_wait_range(inode->i_mapping, pos,
> > +					    roundup_64(pos, blocksize));
> >  }
> 
> Ok, this means we do -three- blocking writebacks through this path
> instead of one or maybe two.
> 
> We already know that this existing blocking writeback case for dirty
> pages over unwritten extents is a significant performance issue for
> some workloads. I have a fix in progress for iomap to handle this
> case without requiring blocking writeback to be done to convert the
> extent to written before we do the truncate.
> 
> Regardless, I think this whole "truncate is allocation unit size
> aware" algorithm is largely unworkable without a rewrite. What XFS
> needs to do on truncate *down* before we start the truncate
> transaction is pretty simple:
> 
> 	- ensure that the new EOF extent tail contains zeroes
> 	- ensure that the range from the existing ip->i_disk_size to
> 	  the new EOF is on disk so data vs metadata ordering is
> 	  correct for crash recovery purposes.
> 
> What this patch does to acheive that is:
> 
> 	1. blocking writeback to clean dirty unwritten/cow blocks at
> 	the new EOF.
> 	2. iomap_truncate_page() writes zeroes into the page cache,
> 	which dirties the pages we just cleaned at the new EOF.
> 	3. blocking writeback to clean the dirty blocks at the new
> 	EOF.
> 	4. truncate_setsize() then writes zeros to partial folios at
> 	the new EOF, dirtying the EOF page again.
> 	5. blocking writeback to clean dirty blocks from the current
> 	on-disk size to the new EOF.
> 
> This is pretty crazy when you stop and think about it. We're writing
> the same EOF block -three- times. The first data write gets
> overwritten by zeroes on the second write, and the third write
> writes the same zeroes as the second write. There are two redundant
> *blocking* writes in this process.
> 
> We can do all this with a single writeback operation if we are a
> little bit smarter about the order of operations we perform and we
> are a little bit smarter in iomap about zeroing dirty pages in the
> page cache:
> 
> 	1. change iomap_zero_range() to do the right thing with
> 	dirty unwritten and cow extents (the patch I've been working
> 	on).
> 
> 	2. pass the range to be zeroed into iomap_truncate_page()
> 	(the fundamental change being made here).
> 
> 	3. zero the required range *through the page cache*
> 	(iomap_zero_range() already does this).
> 
> 	4. write back the XFS inode from ip->i_disk_size to the end
> 	of the range zeroed by iomap_truncate_page()
> 	(xfs_setattr_size() already does this).
> 
> 	5. i_size_write(newsize);
> 
> 	6. invalidate_inode_pages2_range(newsize, -1) to trash all
> 	the page cache beyond the new EOF without doing any zeroing
> 	as we've already done all the zeroing needed to the page
> 	cache through iomap_truncate_page().
> 
> 
> The patch I'm working on for step 1 is below. It still needs to be
> extended to handle the cow case, but I'm unclear on how to exercise
> that case so I haven't written the code to do it. The rest of it is
> just rearranging the code that we already use just to get the order
> of operations right. The only notable change in behaviour is using
> invalidate_inode_pages2_range() instead of truncate_pagecache(),
> because we don't want the EOF page to be dirtied again once we've
> already written zeroes to disk....
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> 
> [RFC] iomap: zeroing needs to be pagecache aware
> 
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> Unwritten extents can have page cache data over the range being
> zeroed so we can't just skip them entirely. Fix this by checking for
> an existing dirty folio over the unwritten range we are zeroing
> and only performing zeroing if the folio is already dirty.
> 
> XXX: how do we detect a iomap containing a cow mapping over a hole
> in iomap_zero_iter()? The XFS code implies this case also needs to
> zero the page cache if there is data present, so trigger for page
> cache lookup only in iomap_zero_iter() needs to handle this case as
> well.

Hmm.  If memory serves, we probably need to adapt the
xfs_buffered/direct_write_iomap_begin functions to return the hole in
srcmap and the cow mapping in the iomap.  RN I think it just returns the
hole.

--D

> Before:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m14.103s
> user    0m0.015s
> sys     0m0.020s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  85.90    0.847616          16     50000           ftruncate
>  14.01    0.138229           2     50000           pwrite64
> ....
> 
> After:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m0.144s
> user    0m0.021s
> sys     0m0.012s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  53.86    0.505964          10     50000           ftruncate
>  46.12    0.433251           8     50000           pwrite64
> ....
> 
> Yup, we get back all the performance.
> 
> As for the "mmap write beyond EOF" data exposure aspect
> documented here:
> 
> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@xxxxxxxxxx/
> 
> With this command:
> 
> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
> 
> Before:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182 ops/sec
> 
> After:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429 ops/sec)
> 
> We see that this post-eof unwritten extent dirty page zeroing is
> working correctly.
> 
> This has passed through most of fstests on a couple of test VMs
> without issues at the moment, so I think this approach to fixing the
> issue is going to be solid once we've worked out how to detect the
> COW-hole mapping case.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_iops.c      | 12 +-----------
>  2 files changed, 41 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 4e8e41c8b3c0..6877474de0c9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -583,11 +583,23 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>   *
>   * Returns a locked reference to the folio at @pos, or an error pointer if the
>   * folio could not be obtained.
> + *
> + * Note: when zeroing unwritten extents, we might have data in the page cache
> + * over an unwritten extent. In this case, we want to do a pure lookup on the
> + * page cache and not create a new folio as we don't need to perform zeroing on
> + * unwritten extents if there is no cached data over the given range.
>   */
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>  
> +	if (iter->flags & IOMAP_ZERO) {
> +		const struct iomap *srcmap = iomap_iter_srcmap(iter);
> +
> +		if (srcmap->type == IOMAP_UNWRITTEN)
> +			fgp &= ~FGP_CREAT;
> +	}
> +
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
>  	fgp |= fgf_set_order(len);
> @@ -1375,7 +1387,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  	loff_t written = 0;
>  
>  	/* already zeroed?  we're done. */
> -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> +	if (srcmap->type == IOMAP_HOLE)
>  		return length;
>  
>  	do {
> @@ -1385,8 +1397,22 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		size_t bytes = min_t(u64, SIZE_MAX, length);
>  
>  		status = iomap_write_begin(iter, pos, bytes, &folio);
> -		if (status)
> +		if (status) {
> +			if (status == -ENOENT) {
> +				/*
> +				 * Unwritten extents need to have page cache
> +				 * lookups done to determine if they have data
> +				 * over them that needs zeroing. If there is no
> +				 * data, we'll get -ENOENT returned here, so we
> +				 * can just skip over this index.
> +				 */
> +				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> +				if (bytes > PAGE_SIZE - offset_in_page(pos))
> +					bytes = PAGE_SIZE - offset_in_page(pos);
> +				goto loop_continue;
> +			}
>  			return status;
> +		}
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
>  
> @@ -1394,6 +1420,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (bytes > folio_size(folio) - offset)
>  			bytes = folio_size(folio) - offset;
>  
> +		/*
> +		 * If the folio over an unwritten extent is clean (i.e. because
> +		 * it has been read from), then it already contains zeros. Hence
> +		 * we can just skip it.
> +		 */
> +		if (srcmap->type == IOMAP_UNWRITTEN &&
> +		    !folio_test_dirty(folio)) {
> +			folio_unlock(folio);
> +			goto loop_continue;
> +		}
> +
>  		folio_zero_range(folio, offset, bytes);
>  		folio_mark_accessed(folio);
>  
> @@ -1401,6 +1438,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (WARN_ON_ONCE(bytes == 0))
>  			return -EIO;
>  
> +loop_continue:
>  		pos += bytes;
>  		length -= bytes;
>  		written += bytes;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8a145ca7d380..e8c9f3018c80 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -838,17 +838,7 @@ xfs_setattr_size(
>  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>  				&did_zeroing);
> -	} else {
> -		/*
> -		 * iomap won't detect a dirty page over an unwritten block (or a
> -		 * cow block over a hole) and subsequently skips zeroing the
> -		 * newly post-EOF portion of the page. Flush the new EOF to
> -		 * convert the block before the pagecache truncate.
> -		 */
> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> -						     newsize);
> -		if (error)
> -			return error;
> +	} else if (newsize != oldsize) {
>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>  	}
>  
>