Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> --- Here's a quick prototype of "option 3" described in my previous mail. This has been spot tested and confirmed to prevent the original stale data exposure problem. More thorough regression testing is still required. Barring unforeseen issues with that, however, I think this is tentatively my new preferred option. The primary reason for that is it avoids looking at extent state and is more in line with what iomap based zeroing should be doing more generically. Because of that, I think this provides a bit more opportunity for follow on fixes (there are other truncate/zeroing problems I've come across during this investigation that still need fixing), cleanup and consolidation of the zeroing code. For example, I think the trajectory of this could look something like: - Genericize a bit more to handle all truncates. - Repurpose iomap_truncate_page() (currently only used by XFS) into a unique implementation from zero range that does explicit zeroing instead of relying on pagecache truncate. - Refactor XFS ranged zeroing to an abstraction that uses a combination of iomap_zero_range() and the new iomap_truncate_page(). >From there we'd hopefully have predictable and functionally correct zeroing in the filesystem. The next step would probably be to see if/how the truncate page and zero range implementations could combine into a single zero range implementation. I have vague thoughts on that, but at this stage I'm not going too deep into how that should look without some sort of functional implementation to base it on. Brian fs/xfs/xfs_iops.c | 49 ++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 40 insertions(+), 9 deletions(-) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 2e10e1c66ad6..1679fafaec6f 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -784,11 +784,13 @@ xfs_setattr_size( { struct xfs_mount *mp = ip->i_mount; struct inode *inode = VFS_I(ip); + struct folio *eof_folio = NULL; xfs_off_t oldsize, newsize; struct xfs_trans *tp; int error; uint lock_flags = 0; bool did_zeroing = false; + bool eof_dirty; ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL)); @@ -841,20 +843,40 @@ xfs_setattr_size( &did_zeroing); } else { /* - * iomap won't detect a dirty page over an unwritten block (or a - * cow block over a hole) and subsequently skips zeroing the - * newly post-EOF portion of the page. Flush the new EOF to - * convert the block before the pagecache truncate. + * iomap won't detect a dirty folio over an unwritten block (or + * a cow block over a hole) and subsequently skips zeroing the + * newly post-EOF portion of the folio. Doing a flush here (i.e. + * as is done for fallocate ZERO_RANGE) updates extent state for + * iomap, but has too much overhead for the truncate path. + * + * Instead, check whether the new EOF is dirty in pagecache. If + * so, hold a reference across the pagecache truncate and dirty + * the folio. This ensures that partial folio zeroing from the + * truncate makes it to disk in the rare event that iomap skips + * zeroing and writeback happens to complete before the + * pagecache truncate. Note that this really should be handled + * properly by iomap zero range. */ - error = filemap_write_and_wait_range(inode->i_mapping, newsize, - newsize); - if (error) - return error; + eof_folio = filemap_lock_folio(inode->i_mapping, + (newsize - 1) >> PAGE_SHIFT); + if (eof_folio) { + if (folio_test_dirty(eof_folio) || + folio_test_writeback(eof_folio)) + eof_dirty = true; + folio_unlock(eof_folio); + if (!eof_dirty) { + folio_put(eof_folio); + eof_folio = NULL; + } + } error = xfs_truncate_page(ip, newsize, &did_zeroing); } - if (error) + if (error) { + if (eof_folio) + folio_put(eof_folio); return error; + } /* * We've already locked out new page faults, so now we can safely remove @@ -878,6 +900,15 @@ xfs_setattr_size( * guaranteed not to write stale data past the new EOF on truncate down. */ truncate_setsize(inode, newsize); + if (eof_folio) { + trace_printk("%d: ino 0x%llx newsize 0x%llx folio idx 0x%lx did_zeroing %d\n", + __LINE__, ip->i_ino, newsize, folio_index(eof_folio), did_zeroing); + if (!did_zeroing) { + filemap_dirty_folio(inode->i_mapping, eof_folio); + did_zeroing = true; + } + folio_put(eof_folio); + } /* * We are going to log the inode size change in this transaction so -- 2.37.3