On 2024/5/21 10:38, Dave Chinner wrote: > On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >> >> When truncating a realtime file unaligned to a shorter size, >> xfs_setattr_size() only flush the EOF page before zeroing out, and >> xfs_truncate_page() also only zeros the EOF block. This could expose >> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not >> a write operation"). >> >> If the sb_rextsize is bigger than one block, and we have a realtime >> inode that contains a long enough written extent. If we unaligned >> truncate into the middle of this extent, xfs_itruncate_extents() could >> split the extent and align the it's tail to sb_rextsize, there maybe >> have more than one blocks more between the end of the file. Since >> xfs_truncate_page() only zeros the trailing portion of the i_blocksize() >> value, so it may leftover some blocks contains stale data that could be >> exposed if we append write it over a long enough distance later. >> >> xfs_truncate_page() should flush, zeros out the entire rtextsize range, >> and make sure the entire zeroed range have been flushed to disk before >> updating the inode size. >> >> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation") >> Reported-by: Chandan Babu R <chandanbabu@xxxxxxxxxx> >> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@xxxxxxxxxxxxxxx >> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> >> --- >> fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++---- >> fs/xfs/xfs_iops.c | 10 ---------- >> 2 files changed, 31 insertions(+), 14 deletions(-) >> >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c >> index 4958cc3337bc..fc379450fe74 100644 >> --- a/fs/xfs/xfs_iomap.c >> +++ b/fs/xfs/xfs_iomap.c >> @@ -1466,12 +1466,39 @@ xfs_truncate_page( >> loff_t pos, >> bool *did_zero) >> { >> + struct xfs_mount *mp = ip->i_mount; >> struct inode *inode = VFS_I(ip); >> unsigned int blocksize = i_blocksize(inode); >> + int error; >> + >> + if (XFS_IS_REALTIME_INODE(ip)) >> + blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize); >> + >> + /* >> + * iomap won't detect a dirty page over an unwritten block (or a >> + * cow block over a hole) and subsequently skips zeroing the >> + * newly post-EOF portion of the page. Flush the new EOF to >> + * convert the block before the pagecache truncate. >> + */ >> + error = filemap_write_and_wait_range(inode->i_mapping, pos, >> + roundup_64(pos, blocksize)); >> + if (error) >> + return error; >> >> if (IS_DAX(inode)) >> - return dax_truncate_page(inode, pos, blocksize, did_zero, >> - &xfs_dax_write_iomap_ops); >> - return iomap_truncate_page(inode, pos, blocksize, did_zero, >> - &xfs_buffered_write_iomap_ops); >> + error = dax_truncate_page(inode, pos, blocksize, did_zero, >> + &xfs_dax_write_iomap_ops); >> + else >> + error = iomap_truncate_page(inode, pos, blocksize, did_zero, >> + &xfs_buffered_write_iomap_ops); >> + if (error) >> + return error; >> + >> + /* >> + * Write back path won't write dirty blocks post EOF folio, >> + * flush the entire zeroed range before updating the inode >> + * size. >> + */ >> + return filemap_write_and_wait_range(inode->i_mapping, pos, >> + roundup_64(pos, blocksize)); >> } > > Ok, this means we do -three- blocking writebacks through this path > instead of one or maybe two. > > We already know that this existing blocking writeback case for dirty > pages over unwritten extents is a significant performance issue for > some workloads. I have a fix in progress for iomap to handle this > case without requiring blocking writeback to be done to convert the > extent to written before we do the truncate. > > Regardless, I think this whole "truncate is allocation unit size > aware" algorithm is largely unworkable without a rewrite. What XFS > needs to do on truncate *down* before we start the truncate > transaction is pretty simple: > > - ensure that the new EOF extent tail contains zeroes > - ensure that the range from the existing ip->i_disk_size to > the new EOF is on disk so data vs metadata ordering is > correct for crash recovery purposes. > > What this patch does to acheive that is: > > 1. blocking writeback to clean dirty unwritten/cow blocks at > the new EOF. > 2. iomap_truncate_page() writes zeroes into the page cache, > which dirties the pages we just cleaned at the new EOF. > 3. blocking writeback to clean the dirty blocks at the new > EOF. > 4. truncate_setsize() then writes zeros to partial folios at > the new EOF, dirtying the EOF page again. > 5. blocking writeback to clean dirty blocks from the current > on-disk size to the new EOF. > > This is pretty crazy when you stop and think about it. We're writing > the same EOF block -three- times. The first data write gets > overwritten by zeroes on the second write, and the third write > writes the same zeroes as the second write. There are two redundant > *blocking* writes in this process. Yes, this is indeed a performance disaster, and iomap_zero_range() should aware the dirty pages. I had the same problem when developing buffered iomap conversion on ext4. > > We can do all this with a single writeback operation if we are a > little bit smarter about the order of operations we perform and we > are a little bit smarter in iomap about zeroing dirty pages in the > page cache: > > 1. change iomap_zero_range() to do the right thing with > dirty unwritten and cow extents (the patch I've been working > on). > > 2. pass the range to be zeroed into iomap_truncate_page() > (the fundamental change being made here). > > 3. zero the required range *through the page cache* > (iomap_zero_range() already does this). > > 4. write back the XFS inode from ip->i_disk_size to the end > of the range zeroed by iomap_truncate_page() > (xfs_setattr_size() already does this). > > 5. i_size_write(newsize); > > 6. invalidate_inode_pages2_range(newsize, -1) to trash all > the page cache beyond the new EOF without doing any zeroing > as we've already done all the zeroing needed to the page > cache through iomap_truncate_page(). > > > The patch I'm working on for step 1 is below. It still needs to be > extended to handle the cow case, but I'm unclear on how to exercise > that case so I haven't written the code to do it. The rest of it is > just rearranging the code that we already use just to get the order > of operations right. The only notable change in behaviour is using > invalidate_inode_pages2_range() instead of truncate_pagecache(), > because we don't want the EOF page to be dirtied again once we've > already written zeroes to disk.... > Indeed, this sounds like the best solution. Since Darrick recommended that we could fix the stale data exposure on realtime inode issue by convert the tail extent to unwritten, I suppose we could do this after fixing the problem. Thanks, Yi.