Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Sat, 18 May 2024 12:26:02 -0700

On Sat, May 18, 2024 at 02:35:02PM +0800, Zhang Yi wrote:
> On 2024/5/18 1:59, Darrick J. Wong wrote:
> > On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>
> >> When truncating a realtime file unaligned to a shorter size,
> >> xfs_setattr_size() only flush the EOF page before zeroing out, and
> >> xfs_truncate_page() also only zeros the EOF block. This could expose
> >> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
> >> a write operation").
> >>
> >> If the sb_rextsize is bigger than one block, and we have a realtime
> >> inode that contains a long enough written extent. If we unaligned
> >> truncate into the middle of this extent, xfs_itruncate_extents() could
> >> split the extent and align the it's tail to sb_rextsize, there maybe
> >> have more than one blocks more between the end of the file. Since
> >> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
> >> value, so it may leftover some blocks contains stale data that could be
> >> exposed if we append write it over a long enough distance later.

Hum.  Is this an appending write into the next rtextent?  For example,
if you start with a file like this:

WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
                    ^ old EOF

Then truncate it improperly like this:

WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
     ^ new EOF               

Then do an extending write like this:

WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
     ^ EOF                    ^ next rtx        ^ append here

And now the problem is that we've exposed stale data that should be
zeroes?

WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
      ^^^^^^^^^^^^^^^                             ^ new EOF
      should be zeroed

> > 
> > IOWs, any time we truncate down, we need to zero every byte from the new
> > EOF all the way to the end of the allocation unit, correct?
> 
> Yeah.
> 
> > 
> > Maybe pictures would be easier to reason with.  Say you have
> > rextsize=30 and a partially written rtextent; each 'W' is a written
> > fsblock and 'u' is an unwritten fsblock:
> > 
> > WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
> >                     ^ old EOF
> > 
> > Now you want to truncate down:
> > 
> > WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
> >      ^ new EOF      ^ old EOF
> > 
> > Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
> > so the truncate leaves the file in this state:
> > 
> > WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
> >      ^ new EOF      ^ old EOF
> > 
> > (where 'z' is a written block with zeroes after EOF)
> > 
> > This is bad because the "W"s between the new and old EOF still contain
> > old credit card info or whatever.  Now if we mmap the file or whatever,
> > we can access those old contents.
> > 
> > So your new patch amends iomap_truncate_page so that it'll zero all the
> > way to the end of the @blocksize parameter.  That fixes the exposure by 
> > writing zeroes to the pagecache before we truncate down:
> > 
> > WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
> >      ^ new EOF      ^ old EOF
> > 
> > Is that correct?
> > 
> 
> Yes, it's correct. However, not only write zeros to the pagecache, but
> also flush to disk, please see below for details.

<nod> iomap_truncate_page writes zeroes to any part of the pagecache
backed by written extents, and then xfs must call
filemap_write_and_wait_range to write the dirty (zeroed) cache out to
disk.

> > If so, then why don't we make xfs_truncate_page convert the post-eof
> > rtextent blocks back to unwritten status:
> > 
> > WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
> >      ^ new EOF      ^ old EOF
> > 
> > If we can do that, then do we need the changes to iomap_truncate_page?
> > Converting the mapping should be much faster than dirtying potentially
> > a lot of data (rt extents can be 1GB in size).
> 
> Now that the exposed stale data range (should be zeroed) is only one
> rtextsize unit, if we convert the post-eof rtextent blocks to unwritten,
> it breaks the alignment of rtextent and the definition of "extsize is used
> to specify the size of the blocks in the real-time section of the
> filesystem", is it fine?

A written -> unwritten extent conversion doesn't change which physical
space extent is mapped to the file data extent; it merely marks the
mapping as unwritten.

For example, if you start with this mapping:

{startoff = 8, startblock 256, blockcount = 8, state = written}

and then convert blocks 13-15 to unwritten, you get:

{startoff = 8, startblock 256, blockcount = 5, state = written}
{startoff = 13, startblock 261, blockcount = 3, state = unwritten}

File blocks 8-15 still map to physical space 256-263.

In xfs, the entire allocation unit is /always/ mapped to the file, even
if parts of it have to be unwritten.  Hole punching on rt, for example,
converts the punched region to unwritten.  This is (iirc) the key
difference between xfs rt and ext4 bigalloc.  xfs doesn't have or need
(or want) the implied cluster allocation code that ext4 has.

I can't tell if there's something that you see that I don't see such
that we really /do/ need to actually write zeroes to the entire tail of
the rtextent; or if you weren't sure that forcing all the post-eof
fsblocks in the rtextent to unwritten (and zapping the pagecache) would
actually preserve the rtextsize alignment.

(Or if there's something else?)

>                          And IIUC, the upcoming xfs force alignment
> extent feature seems also need to follow this alignment, right?

Yes.

> > 
> >> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
> >> and make sure the entire zeroed range have been flushed to disk before
> >> updating the inode size.
> >>
> >> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
> >> Reported-by: Chandan Babu R <chandanbabu@xxxxxxxxxx>
> >> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@xxxxxxxxxxxxxxx
> >> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >> ---
> >>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
> >>  fs/xfs/xfs_iops.c  | 10 ----------
> >>  2 files changed, 31 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> >> index 4958cc3337bc..fc379450fe74 100644
> >> --- a/fs/xfs/xfs_iomap.c
> >> +++ b/fs/xfs/xfs_iomap.c
> >> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
> >>  	loff_t			pos,
> >>  	bool			*did_zero)
> >>  {
> >> +	struct xfs_mount	*mp = ip->i_mount;
> >>  	struct inode		*inode = VFS_I(ip);
> >>  	unsigned int		blocksize = i_blocksize(inode);
> >> +	int			error;
> >> +
> >> +	if (XFS_IS_REALTIME_INODE(ip))
> >> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
> > 
> > Don't opencode xfs_inode_alloc_unitsize, please.
> 
> Ha, I missed the latest added helper, thanks for pointing this out.
> 
> > 
> >> +
> >> +	/*
> >> +	 * iomap won't detect a dirty page over an unwritten block (or a
> >> +	 * cow block over a hole) and subsequently skips zeroing the
> >> +	 * newly post-EOF portion of the page. Flush the new EOF to
> >> +	 * convert the block before the pagecache truncate.
> >> +	 */
> >> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
> >> +					     roundup_64(pos, blocksize));
> >> +	if (error)
> >> +		return error;pos_in_block
> > 
> > Ok so this is hoisting the filemap_write_and_wait_range call from
> > xfs_setattr_size.  It's curious that we need to need to twiddle anything
> > other than the EOF block itself though?
> 
> Since we planed to zero out the dirtied range which is ailgned to the
> extsize instead of the blocksize, ensure one block is not unwritten is
> not enough, we should also make sure that the range which is going to
> zero out is not unwritten, or else the iomap_zero_iter() will skip
> zeroing out the extra blocks.
> 
> For example:
> 
> before zeroing:
>            |<-    extszie   ->|
>         ...dddddddddddddddddddd
>         ...UUUUUUUUUUUUUUUUUUUU
>            ^                  ^
>         new EOF             old EOF    (where 'd' means the pagecache is dirty)
> 
> if we only flush the new EOF block, the result becomes:
> 
>            |<-    extszie   ->|
>            zddddddddddddddddddd
>            ZUUUUUUUUUUUUUUUUUUU
>            ^                  ^
>         new EOF             old EOF
> 
> 
> then the dirty extent range that between new EOF block and the old EOF
> block can't be zeroed sine it's still unwritten. So we have to flush the
> whole range before zeroing out.

"Z" on the second line of the second diagram is a written fsblock with
the tail zeroed, correct?

truncate_setsize -> truncate_pagecache unmaps all the pagecache after
the eof folio and unconditionally zeroes the tail of the eof folio
without regard to the mappings.  Doesn't that cover us here?  After the
truncate_setsize finishes, won't we end up in this state:

           |<-   rextsize   ->|
           zzzzzzzz               
           ZUUUUUUUUUUUUUUUUUUU
           ^      ^           ^
        new EOF   |         old EOF
                  folio boundary

> > 
> >>  
> >>  	if (IS_DAX(inode))
> >> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
> >> -					&xfs_dax_write_iomap_ops);
> >> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
> >> -				   &xfs_buffered_write_iomap_ops);
> >> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
> >> +					  &xfs_dax_write_iomap_ops);
> >> +	else
> >> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
> >> +					    &xfs_buffered_write_iomap_ops);
> >> +	if (error)
> >> +		return error;
> >> +
> >> +	/*
> >> +	 * Write back path won't write dirty blocks post EOF folio,
> >> +	 * flush the entire zeroed range before updating the inode
> >> +	 * size.
> >> +	 */
> >> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
> >> +					    roundup_64(pos, blocksize));
> > 
> > ...but what is the purpose of the second filemap_write_and_wait_range
> > call?  Is that to flush the bytes between new and old EOF to disk before
> > truncate_setsize invalidates the (zeroed) pagecache?
> > 
> 
> The second filemap_write_and_wait_range() call is used to make sure that
> the zeroed data be flushed to disk before we updating i_size. If we don't
> add this one, once the i_size is been changed, the zeroed data which
> beyond the new EOF folio(block) couldn't be write back, because
> iomap_writepage_map()->iomap_writepage_handle_eof() skip that range, so
> the stale data problem is still there.
> 
> For example:
> 
> before zeroing:
>            |<-    extszie   ->|
>            wwwwwwwwwwwwwwwwwwww (pagecache)
>         ...WWWWWWWWWWWWWWWWWWWW (disk)
>            ^                  ^
>         new EOF               EOF   (where 'w' means the pagecache contains data)
> 
> then iomap_truncate_page() zeroing out the pagecache:
> 
>            |<-    extszie   ->|
>            zzzzzzzzzzzzzzzzzzzz (pagecache)
>            WWWWWWWWWWWWWWWWWWWW (disk)
>            ^                  ^
>         new EOF               EOF
> 
> then update i_size, sync and drop cache:
> 
>            |<-    extszie   ->|
>            ZWWWWWWWWWWWWWWWWWWW (disk)
>            ^
>            EOF

<nod> Ok, so this second call to filemap_write_and_wait_range flushes
the newly written pagecache to disk.  If it doesn't work to
force-convert the tail fsblocks of the rtextent to unwritten status,
then I suppose this is necessary if @blocksize != mp->m_sb.blocksize.

--D

> Thanks,
> Yi.
> 
> > 
> >>  }
> >> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> >> index 66f8c47642e8..baeeddf4a6bb 100644
> >> --- a/fs/xfs/xfs_iops.c
> >> +++ b/fs/xfs/xfs_iops.c
> >> @@ -845,16 +845,6 @@ xfs_setattr_size(
> >>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> >>  				&did_zeroing);
> >>  	} else {
> >> -		/*
> >> -		 * iomap won't detect a dirty page over an unwritten block (or a
> >> -		 * cow block over a hole) and subsequently skips zeroing the
> >> -		 * newly post-EOF portion of the page. Flush the new EOF to
> >> -		 * convert the block before the pagecache truncate.
> >> -		 */
> >> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> >> -						     newsize);
> >> -		if (error)
> >> -			return error;
> >>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
> >>  	}
> >>  
> >> -- 
> >> 2.39.2
> >>
> >>
> 
>