[PATCH] xfs: redirty eof folio on truncate to avoid filemap flush

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 28 Oct 2022 09:11:09 -0400

Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
---

Here's a quick prototype of "option 3" described in my previous mail.
This has been spot tested and confirmed to prevent the original stale
data exposure problem. More thorough regression testing is still
required. Barring unforeseen issues with that, however, I think this is
tentatively my new preferred option. The primary reason for that is it
avoids looking at extent state and is more in line with what iomap based
zeroing should be doing more generically.

Because of that, I think this provides a bit more opportunity for follow
on fixes (there are other truncate/zeroing problems I've come across
during this investigation that still need fixing), cleanup and
consolidation of the zeroing code. For example, I think the trajectory
of this could look something like:

- Genericize a bit more to handle all truncates.
- Repurpose iomap_truncate_page() (currently only used by XFS) into a
  unique implementation from zero range that does explicit zeroing
  instead of relying on pagecache truncate.
- Refactor XFS ranged zeroing to an abstraction that uses a combination
  of iomap_zero_range() and the new iomap_truncate_page().

>From there we'd hopefully have predictable and functionally correct
zeroing in the filesystem. The next step would probably be to see if/how
the truncate page and zero range implementations could combine into a
single zero range implementation. I have vague thoughts on that, but at
this stage I'm not going too deep into how that should look without some
sort of functional implementation to base it on.

Brian

 fs/xfs/xfs_iops.c | 49 ++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 40 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 2e10e1c66ad6..1679fafaec6f 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -784,11 +784,13 @@ xfs_setattr_size(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
+	struct folio		*eof_folio = NULL;
 	xfs_off_t		oldsize, newsize;
 	struct xfs_trans	*tp;
 	int			error;
 	uint			lock_flags = 0;
 	bool			did_zeroing = false;
+	bool			eof_dirty;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
 	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
@@ -841,20 +843,40 @@ xfs_setattr_size(
 				&did_zeroing);
 	} else {
 		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
+		 * iomap won't detect a dirty folio over an unwritten block (or
+		 * a cow block over a hole) and subsequently skips zeroing the
+		 * newly post-EOF portion of the folio. Doing a flush here (i.e.
+		 * as is done for fallocate ZERO_RANGE) updates extent state for
+		 * iomap, but has too much overhead for the truncate path.
+		 *
+		 * Instead, check whether the new EOF is dirty in pagecache. If
+		 * so, hold a reference across the pagecache truncate and dirty
+		 * the folio. This ensures that partial folio zeroing from the
+		 * truncate makes it to disk in the rare event that iomap skips
+		 * zeroing and writeback happens to complete before the
+		 * pagecache truncate. Note that this really should be handled
+		 * properly by iomap zero range.
 		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
+		eof_folio = filemap_lock_folio(inode->i_mapping,
+					(newsize - 1) >> PAGE_SHIFT);
+		if (eof_folio) {
+			if (folio_test_dirty(eof_folio) ||
+			    folio_test_writeback(eof_folio))
+				eof_dirty = true;
+			folio_unlock(eof_folio);
+			if (!eof_dirty) {
+				folio_put(eof_folio);
+				eof_folio = NULL;
+			}
+		}
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 
-	if (error)
+	if (error) {
+		if (eof_folio)
+			folio_put(eof_folio);
 		return error;
+	}
 
 	/*
 	 * We've already locked out new page faults, so now we can safely remove
@@ -878,6 +900,15 @@ xfs_setattr_size(
 	 * guaranteed not to write stale data past the new EOF on truncate down.
 	 */
 	truncate_setsize(inode, newsize);
+	if (eof_folio) {
+		trace_printk("%d: ino 0x%llx newsize 0x%llx folio idx 0x%lx did_zeroing %d\n",
+			__LINE__, ip->i_ino, newsize, folio_index(eof_folio), did_zeroing);
+		if (!did_zeroing) {
+			filemap_dirty_folio(inode->i_mapping, eof_folio);
+			did_zeroing = true;
+		}
+		folio_put(eof_folio);
+	}
 
 	/*
 	 * We are going to log the inode size change in this transaction so
-- 
2.37.3