Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Fri, 17 May 2024 10:59:00 -0700

On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> 
> When truncating a realtime file unaligned to a shorter size,
> xfs_setattr_size() only flush the EOF page before zeroing out, and
> xfs_truncate_page() also only zeros the EOF block. This could expose
> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
> a write operation").
> 
> If the sb_rextsize is bigger than one block, and we have a realtime
> inode that contains a long enough written extent. If we unaligned
> truncate into the middle of this extent, xfs_itruncate_extents() could
> split the extent and align the it's tail to sb_rextsize, there maybe
> have more than one blocks more between the end of the file. Since
> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
> value, so it may leftover some blocks contains stale data that could be
> exposed if we append write it over a long enough distance later.

IOWs, any time we truncate down, we need to zero every byte from the new
EOF all the way to the end of the allocation unit, correct?

Maybe pictures would be easier to reason with.  Say you have
rextsize=30 and a partially written rtextent; each 'W' is a written
fsblock and 'u' is an unwritten fsblock:

WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
                    ^ old EOF

Now you want to truncate down:

WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
     ^ new EOF      ^ old EOF

Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
so the truncate leaves the file in this state:

WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
     ^ new EOF      ^ old EOF

(where 'z' is a written block with zeroes after EOF)

This is bad because the "W"s between the new and old EOF still contain
old credit card info or whatever.  Now if we mmap the file or whatever,
we can access those old contents.

So your new patch amends iomap_truncate_page so that it'll zero all the
way to the end of the @blocksize parameter.  That fixes the exposure by 
writing zeroes to the pagecache before we truncate down:

WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
     ^ new EOF      ^ old EOF

Is that correct?

If so, then why don't we make xfs_truncate_page convert the post-eof
rtextent blocks back to unwritten status:

WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
     ^ new EOF      ^ old EOF

If we can do that, then do we need the changes to iomap_truncate_page?
Converting the mapping should be much faster than dirtying potentially
a lot of data (rt extents can be 1GB in size).

> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
> and make sure the entire zeroed range have been flushed to disk before
> updating the inode size.
> 
> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
> Reported-by: Chandan Babu R <chandanbabu@xxxxxxxxxx>
> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@xxxxxxxxxxxxxxx
> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> ---
>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_iops.c  | 10 ----------
>  2 files changed, 31 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 4958cc3337bc..fc379450fe74 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>  	loff_t			pos,
>  	bool			*did_zero)
>  {
> +	struct xfs_mount	*mp = ip->i_mount;
>  	struct inode		*inode = VFS_I(ip);
>  	unsigned int		blocksize = i_blocksize(inode);
> +	int			error;
> +
> +	if (XFS_IS_REALTIME_INODE(ip))
> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);

Don't opencode xfs_inode_alloc_unitsize, please.

> +
> +	/*
> +	 * iomap won't detect a dirty page over an unwritten block (or a
> +	 * cow block over a hole) and subsequently skips zeroing the
> +	 * newly post-EOF portion of the page. Flush the new EOF to
> +	 * convert the block before the pagecache truncate.
> +	 */
> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
> +					     roundup_64(pos, blocksize));
> +	if (error)
> +		return error;pos_in_block

Ok so this is hoisting the filemap_write_and_wait_range call from
xfs_setattr_size.  It's curious that we need to need to twiddle anything
other than the EOF block itself though?

>  
>  	if (IS_DAX(inode))
> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
> -					&xfs_dax_write_iomap_ops);
> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
> -				   &xfs_buffered_write_iomap_ops);
> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
> +					  &xfs_dax_write_iomap_ops);
> +	else
> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
> +					    &xfs_buffered_write_iomap_ops);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Write back path won't write dirty blocks post EOF folio,
> +	 * flush the entire zeroed range before updating the inode
> +	 * size.
> +	 */
> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
> +					    roundup_64(pos, blocksize));

...but what is the purpose of the second filemap_write_and_wait_range
call?  Is that to flush the bytes between new and old EOF to disk before
truncate_setsize invalidates the (zeroed) pagecache?

--D

>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 66f8c47642e8..baeeddf4a6bb 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -845,16 +845,6 @@ xfs_setattr_size(
>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>  				&did_zeroing);
>  	} else {
> -		/*
> -		 * iomap won't detect a dirty page over an unwritten block (or a
> -		 * cow block over a hole) and subsequently skips zeroing the
> -		 * newly post-EOF portion of the page. Flush the new EOF to
> -		 * convert the block before the pagecache truncate.
> -		 */
> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> -						     newsize);
> -		if (error)
> -			return error;
>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>  	}
>  
> -- 
> 2.39.2
> 
>