On Tue, May 21, 2024 at 12:38:30PM +1000, Dave Chinner wrote: > On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote: > > From: Zhang Yi <yi.zhang@xxxxxxxxxx> > > > > When truncating a realtime file unaligned to a shorter size, > > xfs_setattr_size() only flush the EOF page before zeroing out, and > > xfs_truncate_page() also only zeros the EOF block. This could expose > > stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not > > a write operation"). > > > > If the sb_rextsize is bigger than one block, and we have a realtime > > inode that contains a long enough written extent. If we unaligned > > truncate into the middle of this extent, xfs_itruncate_extents() could > > split the extent and align the it's tail to sb_rextsize, there maybe > > have more than one blocks more between the end of the file. Since > > xfs_truncate_page() only zeros the trailing portion of the i_blocksize() > > value, so it may leftover some blocks contains stale data that could be > > exposed if we append write it over a long enough distance later. > > > > xfs_truncate_page() should flush, zeros out the entire rtextsize range, > > and make sure the entire zeroed range have been flushed to disk before > > updating the inode size. > > > > Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation") > > Reported-by: Chandan Babu R <chandanbabu@xxxxxxxxxx> > > Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@xxxxxxxxxxxxxxx > > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> > > --- > > fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++---- > > fs/xfs/xfs_iops.c | 10 ---------- > > 2 files changed, 31 insertions(+), 14 deletions(-) > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > index 4958cc3337bc..fc379450fe74 100644 > > --- a/fs/xfs/xfs_iomap.c > > +++ b/fs/xfs/xfs_iomap.c > > @@ -1466,12 +1466,39 @@ xfs_truncate_page( > > loff_t pos, > > bool *did_zero) > > { > > + struct xfs_mount *mp = ip->i_mount; > > struct inode *inode = VFS_I(ip); > > unsigned int blocksize = i_blocksize(inode); > > + int error; > > + > > + if (XFS_IS_REALTIME_INODE(ip)) > > + blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize); > > + > > + /* > > + * iomap won't detect a dirty page over an unwritten block (or a > > + * cow block over a hole) and subsequently skips zeroing the > > + * newly post-EOF portion of the page. Flush the new EOF to > > + * convert the block before the pagecache truncate. > > + */ > > + error = filemap_write_and_wait_range(inode->i_mapping, pos, > > + roundup_64(pos, blocksize)); > > + if (error) > > + return error; > > > > if (IS_DAX(inode)) > > - return dax_truncate_page(inode, pos, blocksize, did_zero, > > - &xfs_dax_write_iomap_ops); > > - return iomap_truncate_page(inode, pos, blocksize, did_zero, > > - &xfs_buffered_write_iomap_ops); > > + error = dax_truncate_page(inode, pos, blocksize, did_zero, > > + &xfs_dax_write_iomap_ops); > > + else > > + error = iomap_truncate_page(inode, pos, blocksize, did_zero, > > + &xfs_buffered_write_iomap_ops); > > + if (error) > > + return error; > > + > > + /* > > + * Write back path won't write dirty blocks post EOF folio, > > + * flush the entire zeroed range before updating the inode > > + * size. > > + */ > > + return filemap_write_and_wait_range(inode->i_mapping, pos, > > + roundup_64(pos, blocksize)); > > } > > Ok, this means we do -three- blocking writebacks through this path > instead of one or maybe two. > > We already know that this existing blocking writeback case for dirty > pages over unwritten extents is a significant performance issue for > some workloads. I have a fix in progress for iomap to handle this > case without requiring blocking writeback to be done to convert the > extent to written before we do the truncate. > > Regardless, I think this whole "truncate is allocation unit size > aware" algorithm is largely unworkable without a rewrite. What XFS > needs to do on truncate *down* before we start the truncate > transaction is pretty simple: > > - ensure that the new EOF extent tail contains zeroes > - ensure that the range from the existing ip->i_disk_size to > the new EOF is on disk so data vs metadata ordering is > correct for crash recovery purposes. > > What this patch does to acheive that is: > > 1. blocking writeback to clean dirty unwritten/cow blocks at > the new EOF. > 2. iomap_truncate_page() writes zeroes into the page cache, > which dirties the pages we just cleaned at the new EOF. > 3. blocking writeback to clean the dirty blocks at the new > EOF. > 4. truncate_setsize() then writes zeros to partial folios at > the new EOF, dirtying the EOF page again. > 5. blocking writeback to clean dirty blocks from the current > on-disk size to the new EOF. > > This is pretty crazy when you stop and think about it. We're writing > the same EOF block -three- times. The first data write gets > overwritten by zeroes on the second write, and the third write > writes the same zeroes as the second write. There are two redundant > *blocking* writes in this process. > > We can do all this with a single writeback operation if we are a > little bit smarter about the order of operations we perform and we > are a little bit smarter in iomap about zeroing dirty pages in the > page cache: > > 1. change iomap_zero_range() to do the right thing with > dirty unwritten and cow extents (the patch I've been working > on). > > 2. pass the range to be zeroed into iomap_truncate_page() > (the fundamental change being made here). > > 3. zero the required range *through the page cache* > (iomap_zero_range() already does this). > > 4. write back the XFS inode from ip->i_disk_size to the end > of the range zeroed by iomap_truncate_page() > (xfs_setattr_size() already does this). > > 5. i_size_write(newsize); > > 6. invalidate_inode_pages2_range(newsize, -1) to trash all > the page cache beyond the new EOF without doing any zeroing > as we've already done all the zeroing needed to the page > cache through iomap_truncate_page(). > > > The patch I'm working on for step 1 is below. It still needs to be > extended to handle the cow case, but I'm unclear on how to exercise > that case so I haven't written the code to do it. The rest of it is > just rearranging the code that we already use just to get the order > of operations right. The only notable change in behaviour is using > invalidate_inode_pages2_range() instead of truncate_pagecache(), > because we don't want the EOF page to be dirtied again once we've > already written zeroes to disk.... > > -- > Dave Chinner > david@xxxxxxxxxxxxx > > > [RFC] iomap: zeroing needs to be pagecache aware > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Unwritten extents can have page cache data over the range being > zeroed so we can't just skip them entirely. Fix this by checking for > an existing dirty folio over the unwritten range we are zeroing > and only performing zeroing if the folio is already dirty. > > XXX: how do we detect a iomap containing a cow mapping over a hole > in iomap_zero_iter()? The XFS code implies this case also needs to > zero the page cache if there is data present, so trigger for page > cache lookup only in iomap_zero_iter() needs to handle this case as > well. Hmm. If memory serves, we probably need to adapt the xfs_buffered/direct_write_iomap_begin functions to return the hole in srcmap and the cow mapping in the iomap. RN I think it just returns the hole. --D > Before: > > $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000 > path /mnt/scratch/foo, 50000 iters > > real 0m14.103s > user 0m0.015s > sys 0m0.020s > > $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000 > path /mnt/scratch/foo, 50000 iters > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 85.90 0.847616 16 50000 ftruncate > 14.01 0.138229 2 50000 pwrite64 > .... > > After: > > $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000 > path /mnt/scratch/foo, 50000 iters > > real 0m0.144s > user 0m0.021s > sys 0m0.012s > > $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000 > path /mnt/scratch/foo, 50000 iters > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 53.86 0.505964 10 50000 ftruncate > 46.12 0.433251 8 50000 pwrite64 > .... > > Yup, we get back all the performance. > > As for the "mmap write beyond EOF" data exposure aspect > documented here: > > https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@xxxxxxxxxx/ > > With this command: > > $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \ > -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \ > -c fsync -c "pread -v 3k 32" /mnt/scratch/foo > > Before: > > wrote 1024/1024 bytes at offset 0 > 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec) > wrote 4096/4096 bytes at offset 32768 > 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec) > 00000c00: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX > 00000c10: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX > read 32/32 bytes at offset 3072 > 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182 ops/sec > > After: > > wrote 1024/1024 bytes at offset 0 > 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec) > wrote 4096/4096 bytes at offset 32768 > 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec) > 00000c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000c10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > read 32/32 bytes at offset 3072 > 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429 ops/sec) > > We see that this post-eof unwritten extent dirty page zeroing is > working correctly. > > This has passed through most of fstests on a couple of test VMs > without issues at the moment, so I think this approach to fixing the > issue is going to be solid once we've worked out how to detect the > COW-hole mapping case. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++++++++++++++++-- > fs/xfs/xfs_iops.c | 12 +----------- > 2 files changed, 41 insertions(+), 13 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 4e8e41c8b3c0..6877474de0c9 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -583,11 +583,23 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); > * > * Returns a locked reference to the folio at @pos, or an error pointer if the > * folio could not be obtained. > + * > + * Note: when zeroing unwritten extents, we might have data in the page cache > + * over an unwritten extent. In this case, we want to do a pure lookup on the > + * page cache and not create a new folio as we don't need to perform zeroing on > + * unwritten extents if there is no cached data over the given range. > */ > struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) > { > fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; > > + if (iter->flags & IOMAP_ZERO) { > + const struct iomap *srcmap = iomap_iter_srcmap(iter); > + > + if (srcmap->type == IOMAP_UNWRITTEN) > + fgp &= ~FGP_CREAT; > + } > + > if (iter->flags & IOMAP_NOWAIT) > fgp |= FGP_NOWAIT; > fgp |= fgf_set_order(len); > @@ -1375,7 +1387,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > loff_t written = 0; > > /* already zeroed? we're done. */ > - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) > + if (srcmap->type == IOMAP_HOLE) > return length; > > do { > @@ -1385,8 +1397,22 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > size_t bytes = min_t(u64, SIZE_MAX, length); > > status = iomap_write_begin(iter, pos, bytes, &folio); > - if (status) > + if (status) { > + if (status == -ENOENT) { > + /* > + * Unwritten extents need to have page cache > + * lookups done to determine if they have data > + * over them that needs zeroing. If there is no > + * data, we'll get -ENOENT returned here, so we > + * can just skip over this index. > + */ > + WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN); > + if (bytes > PAGE_SIZE - offset_in_page(pos)) > + bytes = PAGE_SIZE - offset_in_page(pos); > + goto loop_continue; > + } > return status; > + } > if (iter->iomap.flags & IOMAP_F_STALE) > break; > > @@ -1394,6 +1420,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > if (bytes > folio_size(folio) - offset) > bytes = folio_size(folio) - offset; > > + /* > + * If the folio over an unwritten extent is clean (i.e. because > + * it has been read from), then it already contains zeros. Hence > + * we can just skip it. > + */ > + if (srcmap->type == IOMAP_UNWRITTEN && > + !folio_test_dirty(folio)) { > + folio_unlock(folio); > + goto loop_continue; > + } > + > folio_zero_range(folio, offset, bytes); > folio_mark_accessed(folio); > > @@ -1401,6 +1438,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > if (WARN_ON_ONCE(bytes == 0)) > return -EIO; > > +loop_continue: > pos += bytes; > length -= bytes; > written += bytes; > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index 8a145ca7d380..e8c9f3018c80 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -838,17 +838,7 @@ xfs_setattr_size( > trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); > error = xfs_zero_range(ip, oldsize, newsize - oldsize, > &did_zeroing); > - } else { > - /* > - * iomap won't detect a dirty page over an unwritten block (or a > - * cow block over a hole) and subsequently skips zeroing the > - * newly post-EOF portion of the page. Flush the new EOF to > - * convert the block before the pagecache truncate. > - */ > - error = filemap_write_and_wait_range(inode->i_mapping, newsize, > - newsize); > - if (error) > - return error; > + } else if (newsize != oldsize) { > error = xfs_truncate_page(ip, newsize, &did_zeroing); > } > >