On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@xxxxxxxxxx> > > When truncating a realtime file unaligned to a shorter size, > xfs_setattr_size() only flush the EOF page before zeroing out, and > xfs_truncate_page() also only zeros the EOF block. This could expose > stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not > a write operation"). > > If the sb_rextsize is bigger than one block, and we have a realtime > inode that contains a long enough written extent. If we unaligned > truncate into the middle of this extent, xfs_itruncate_extents() could > split the extent and align the it's tail to sb_rextsize, there maybe > have more than one blocks more between the end of the file. Since > xfs_truncate_page() only zeros the trailing portion of the i_blocksize() > value, so it may leftover some blocks contains stale data that could be > exposed if we append write it over a long enough distance later. IOWs, any time we truncate down, we need to zero every byte from the new EOF all the way to the end of the allocation unit, correct? Maybe pictures would be easier to reason with. Say you have rextsize=30 and a partially written rtextent; each 'W' is a written fsblock and 'u' is an unwritten fsblock: WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu ^ old EOF Now you want to truncate down: WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu ^ new EOF ^ old EOF Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize, so the truncate leaves the file in this state: WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu ^ new EOF ^ old EOF (where 'z' is a written block with zeroes after EOF) This is bad because the "W"s between the new and old EOF still contain old credit card info or whatever. Now if we mmap the file or whatever, we can access those old contents. So your new patch amends iomap_truncate_page so that it'll zero all the way to the end of the @blocksize parameter. That fixes the exposure by writing zeroes to the pagecache before we truncate down: WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu ^ new EOF ^ old EOF Is that correct? If so, then why don't we make xfs_truncate_page convert the post-eof rtextent blocks back to unwritten status: WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu ^ new EOF ^ old EOF If we can do that, then do we need the changes to iomap_truncate_page? Converting the mapping should be much faster than dirtying potentially a lot of data (rt extents can be 1GB in size). > xfs_truncate_page() should flush, zeros out the entire rtextsize range, > and make sure the entire zeroed range have been flushed to disk before > updating the inode size. > > Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation") > Reported-by: Chandan Babu R <chandanbabu@xxxxxxxxxx> > Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@xxxxxxxxxxxxxxx > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> > --- > fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++---- > fs/xfs/xfs_iops.c | 10 ---------- > 2 files changed, 31 insertions(+), 14 deletions(-) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 4958cc3337bc..fc379450fe74 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -1466,12 +1466,39 @@ xfs_truncate_page( > loff_t pos, > bool *did_zero) > { > + struct xfs_mount *mp = ip->i_mount; > struct inode *inode = VFS_I(ip); > unsigned int blocksize = i_blocksize(inode); > + int error; > + > + if (XFS_IS_REALTIME_INODE(ip)) > + blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize); Don't opencode xfs_inode_alloc_unitsize, please. > + > + /* > + * iomap won't detect a dirty page over an unwritten block (or a > + * cow block over a hole) and subsequently skips zeroing the > + * newly post-EOF portion of the page. Flush the new EOF to > + * convert the block before the pagecache truncate. > + */ > + error = filemap_write_and_wait_range(inode->i_mapping, pos, > + roundup_64(pos, blocksize)); > + if (error) > + return error;pos_in_block Ok so this is hoisting the filemap_write_and_wait_range call from xfs_setattr_size. It's curious that we need to need to twiddle anything other than the EOF block itself though? > > if (IS_DAX(inode)) > - return dax_truncate_page(inode, pos, blocksize, did_zero, > - &xfs_dax_write_iomap_ops); > - return iomap_truncate_page(inode, pos, blocksize, did_zero, > - &xfs_buffered_write_iomap_ops); > + error = dax_truncate_page(inode, pos, blocksize, did_zero, > + &xfs_dax_write_iomap_ops); > + else > + error = iomap_truncate_page(inode, pos, blocksize, did_zero, > + &xfs_buffered_write_iomap_ops); > + if (error) > + return error; > + > + /* > + * Write back path won't write dirty blocks post EOF folio, > + * flush the entire zeroed range before updating the inode > + * size. > + */ > + return filemap_write_and_wait_range(inode->i_mapping, pos, > + roundup_64(pos, blocksize)); ...but what is the purpose of the second filemap_write_and_wait_range call? Is that to flush the bytes between new and old EOF to disk before truncate_setsize invalidates the (zeroed) pagecache? --D > } > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index 66f8c47642e8..baeeddf4a6bb 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -845,16 +845,6 @@ xfs_setattr_size( > error = xfs_zero_range(ip, oldsize, newsize - oldsize, > &did_zeroing); > } else { > - /* > - * iomap won't detect a dirty page over an unwritten block (or a > - * cow block over a hole) and subsequently skips zeroing the > - * newly post-EOF portion of the page. Flush the new EOF to > - * convert the block before the pagecache truncate. > - */ > - error = filemap_write_and_wait_range(inode->i_mapping, newsize, > - newsize); > - if (error) > - return error; > error = xfs_truncate_page(ip, newsize, &did_zeroing); > } > > -- > 2.39.2 > >