Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Mon, 20 May 2024 15:11:32 +0800



On 2024/5/20 14:56, Zhang Yi wrote:
> On 2024/5/19 3:26, Darrick J. Wong wrote:
>> On Sat, May 18, 2024 at 02:35:02PM +0800, Zhang Yi wrote:
>>> On 2024/5/18 1:59, Darrick J. Wong wrote:
>>>> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>>
>>>>> When truncating a realtime file unaligned to a shorter size,
>>>>> xfs_setattr_size() only flush the EOF page before zeroing out, and
>>>>> xfs_truncate_page() also only zeros the EOF block. This could expose
>>>>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
>>>>> a write operation").
>>>>>
>>>>> If the sb_rextsize is bigger than one block, and we have a realtime
>>>>> inode that contains a long enough written extent. If we unaligned
>>>>> truncate into the middle of this extent, xfs_itruncate_extents() could
>>>>> split the extent and align the it's tail to sb_rextsize, there maybe
>>>>> have more than one blocks more between the end of the file. Since
>>>>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
>>>>> value, so it may leftover some blocks contains stale data that could be
>>>>> exposed if we append write it over a long enough distance later.
>>
>> Hum.  Is this an appending write into the next rtextent?  For example,
>> if you start with a file like this:
>>
>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>                     ^ old EOF
>>
>> Then truncate it improperly like this:
>>
>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>>      ^ new EOF               
>>
>> Then do an extending write like this:
>>
>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
>>      ^ EOF                    ^ next rtx        ^ append here
>>
>> And now the problem is that we've exposed stale data that should be
>> zeroes?
>>
>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
>>       ^^^^^^^^^^^^^^^                             ^ new EOF
>>       should be zeroed
>>
> 
> Yeah.
> 
>>>>
>>>> IOWs, any time we truncate down, we need to zero every byte from the new
>>>> EOF all the way to the end of the allocation unit, correct?
>>>
>>> Yeah.
>>>
>>>>
>>>> Maybe pictures would be easier to reason with.  Say you have
>>>> rextsize=30 and a partially written rtextent; each 'W' is a written
>>>> fsblock and 'u' is an unwritten fsblock:
>>>>
>>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>>                     ^ old EOF
>>>>
>>>> Now you want to truncate down:
>>>>
>>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>>      ^ new EOF      ^ old EOF
>>>>
>>>> Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
>>>> so the truncate leaves the file in this state:
>>>>
>>>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>>>>      ^ new EOF      ^ old EOF
>>>>
>>>> (where 'z' is a written block with zeroes after EOF)
>>>>
>>>> This is bad because the "W"s between the new and old EOF still contain
>>>> old credit card info or whatever.  Now if we mmap the file or whatever,
>>>> we can access those old contents.
>>>>
>>>> So your new patch amends iomap_truncate_page so that it'll zero all the
>>>> way to the end of the @blocksize parameter.  That fixes the exposure by 
>>>> writing zeroes to the pagecache before we truncate down:
>>>>
>>>> WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
>>>>      ^ new EOF      ^ old EOF
>>>>
>>>> Is that correct?
>>>>
>>>
>>> Yes, it's correct. However, not only write zeros to the pagecache, but
>>> also flush to disk, please see below for details.
>>
>> <nod> iomap_truncate_page writes zeroes to any part of the pagecache
>> backed by written extents, and then xfs must call
>> filemap_write_and_wait_range to write the dirty (zeroed) cache out to
>> disk.
>>
>>>> If so, then why don't we make xfs_truncate_page convert the post-eof
>>>> rtextent blocks back to unwritten status:
>>>>
>>>> WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
>>>>      ^ new EOF      ^ old EOF
>>>>
>>>> If we can do that, then do we need the changes to iomap_truncate_page?
>>>> Converting the mapping should be much faster than dirtying potentially
>>>> a lot of data (rt extents can be 1GB in size).
>>>
>>> Now that the exposed stale data range (should be zeroed) is only one
>>> rtextsize unit, if we convert the post-eof rtextent blocks to unwritten,
>>> it breaks the alignment of rtextent and the definition of "extsize is used
>>> to specify the size of the blocks in the real-time section of the
>>> filesystem", is it fine?
>>
>> A written -> unwritten extent conversion doesn't change which physical
>> space extent is mapped to the file data extent; it merely marks the
>> mapping as unwritten.
>>
>> For example, if you start with this mapping:
>>
>> {startoff = 8, startblock 256, blockcount = 8, state = written}
>>
>> and then convert blocks 13-15 to unwritten, you get:
>>
>> {startoff = 8, startblock 256, blockcount = 5, state = written}
>> {startoff = 13, startblock 261, blockcount = 3, state = unwritten}
>>
>> File blocks 8-15 still map to physical space 256-263.
> 
> Yeah, indeed.
> 
>>
>> In xfs, the entire allocation unit is /always/ mapped to the file, even
>> if parts of it have to be unwritten.  Hole punching on rt, for example,
>> converts the punched region to unwritten.  This is (iirc) the key
>> difference between xfs rt and ext4 bigalloc.  xfs doesn't have or need
>> (or want) the implied cluster allocation code that ext4 has.
>>
> 
> I checked the xfs_file_fallocate(), it looks like hole punching on realtime
> inode is still follow the rtextsize alignment, i.e. if we punch hole on a
> file that only contains one written extent, it doesn't split it and convet
> the punched range to unwritten. Please take a look at
> xfs_file_fallocate()->xfs_free_file_space(), it aligned the freeing range
> and zeroing out the whole unligned range in one reextsize unit, and
> FALLOC_FL_ZERO_RANGE is the same.
> 
>  836	/* We can only free complete realtime extents. */
>  837	if (xfs_inode_has_bigrtalloc(ip)) {
>  838		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
>  839 		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
>  840	}
> ...
>  864	error = xfs_zero_range(ip, offset, len, NULL);
> 
> And I tested it on my machine, it's true that it doesn't do the convertion.
> 
>   # mkfs.xfs -f -rrtdev=/dev/nvme0n1 -f -m reflink=0,rmapbt=0, -d rtinherit=1 -r extsize=28k /dev/pmem2s
>   # mount -ortdev=/dev/nvme0n1 /dev/pmem2s /mnt/scratch
>   # xfs_io -f -c "pwrite 0 28k" -c "fsync" /mnt/scratch/foo
>   # xfs_io -c "fpunch 4k 24k" /mnt/scratch/foo
>   # umount /mnt/scratch
> 
>   # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
>    u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
>    0:[0,0,7,0]
> 
> Am I missed something?
> 
>> I can't tell if there's something that you see that I don't see such
>> that we really /do/ need to actually write zeroes to the entire tail of
>> the rtextent; or if you weren't sure that forcing all the post-eof
>> fsblocks in the rtextent to unwritten (and zapping the pagecache) would
>> actually preserve the rtextsize alignment.
> 
> I haven't found any restrictions yet, and I also noticed that a simple
> write is not guaranteed to align the extent to rtextsize, since the write
> back path doesn't zeroing out the extra blocks that align to the
> rtextsize.
> 
>   # #extsize=28k
>   # xfs_io -d -f -c  "pwrite 0 4k" -c "fsync" /mnt/scratch/foo
>   # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
>      u3.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
>      0:[0,0,1,0]
>      1:[1,1,6,1]
> 
> So I guess convert the tail fsblocks of the rtextent to unwritten status
> could work. However, I'm a little confused, besides the write operation,
> other operations like above punch hold and zero range, they seem to be
> doing their best to follow the alignment rule since commit fe341eb151ec
> ("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned
> to rt extent size") [1], it looks like this commit is to fix some issues,
> so I'm not sure that converting to unwritten would always preserve the
> rtextsize alignment.
> 
> [1]. https://lore.kernel.org/linux-xfs/159950168085.582172.4254559621934598919.stgit@magnolia/
> 
>>
>> (Or if there's something else?)
>>
>>>                          And IIUC, the upcoming xfs force alignment
>>> extent feature seems also need to follow this alignment, right?
>>
>> Yes.
>>
>>>>
>>>>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
>>>>> and make sure the entire zeroed range have been flushed to disk before
>>>>> updating the inode size.
>>>>>
>>>>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
>>>>> Reported-by: Chandan Babu R <chandanbabu@xxxxxxxxxx>
>>>>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@xxxxxxxxxxxxxxx
>>>>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>> ---
>>>>>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>>>>>  fs/xfs/xfs_iops.c  | 10 ----------
>>>>>  2 files changed, 31 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>>>>> index 4958cc3337bc..fc379450fe74 100644
>>>>> --- a/fs/xfs/xfs_iomap.c
>>>>> +++ b/fs/xfs/xfs_iomap.c
>>>>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>>>>>  	loff_t			pos,
>>>>>  	bool			*did_zero)
>>>>>  {
>>>>> +	struct xfs_mount	*mp = ip->i_mount;
>>>>>  	struct inode		*inode = VFS_I(ip);
>>>>>  	unsigned int		blocksize = i_blocksize(inode);
>>>>> +	int			error;
>>>>> +
>>>>> +	if (XFS_IS_REALTIME_INODE(ip))
>>>>> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
>>>>
>>>> Don't opencode xfs_inode_alloc_unitsize, please.
>>>
>>> Ha, I missed the latest added helper, thanks for pointing this out.
>>>
>>>>
>>>>> +
>>>>> +	/*
>>>>> +	 * iomap won't detect a dirty page over an unwritten block (or a
>>>>> +	 * cow block over a hole) and subsequently skips zeroing the
>>>>> +	 * newly post-EOF portion of the page. Flush the new EOF to
>>>>> +	 * convert the block before the pagecache truncate.
>>>>> +	 */
>>>>> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
>>>>> +					     roundup_64(pos, blocksize));
>>>>> +	if (error)
>>>>> +		return error;pos_in_block
>>>>
>>>> Ok so this is hoisting the filemap_write_and_wait_range call from
>>>> xfs_setattr_size.  It's curious that we need to need to twiddle anything
>>>> other than the EOF block itself though?
>>>
>>> Since we planed to zero out the dirtied range which is ailgned to the
>>> extsize instead of the blocksize, ensure one block is not unwritten is
>>> not enough, we should also make sure that the range which is going to
>>> zero out is not unwritten, or else the iomap_zero_iter() will skip
>>> zeroing out the extra blocks.
>>>
>>> For example:
>>>
>>> before zeroing:
>>>            |<-    extszie   ->|
>>>         ...dddddddddddddddddddd
>>>         ...UUUUUUUUUUUUUUUUUUUU
>>>            ^                  ^
>>>         new EOF             old EOF    (where 'd' means the pagecache is dirty)
>>>
>>> if we only flush the new EOF block, the result becomes:
>>>
>>>            |<-    extszie   ->|
>>>            zddddddddddddddddddd
>>>            ZUUUUUUUUUUUUUUUUUUU
>>>            ^                  ^
>>>         new EOF             old EOF
>>>
>>>
>>> then the dirty extent range that between new EOF block and the old EOF
>>> block can't be zeroed sine it's still unwritten. So we have to flush the
>>> whole range before zeroing out.
>>
>> "Z" on the second line of the second diagram is a written fsblock with
>> the tail zeroed, correct?
> 
> Yeah,
> 
>>
>> truncate_setsize -> truncate_pagecache unmaps all the pagecache after
>> the eof folio and unconditionally zeroes the tail of the eof folio
>> without regard to the mappings.  Doesn't that cover us here?  After the
>> truncate_setsize finishes, won't we end up in this state:
>>
>>            |<-   rextsize   ->|
>>            zzzzzzzz               
>>            ZUUUUUUUUUUUUUUUUUUU
>>            ^      ^           ^
>>         new EOF   |         old EOF
>>                   folio boundary
>>
> 
> Yeah, this case is fine, but the below case is not fine.
> 
> truncate                          write back
> xfs_setattr_size()
>  xfs_truncate_page()
>   filemap_write_and_wait_range(newsize, newsize) <- A
>   iomap_zero_range() <- B
>                                   flush dirty pages <- C
>  truncate_setsize() <- D
> 
> Please assume if a concurrent write back happenes just before
> truncate_setsize(), the state of the file changes as below:
> 
> A:
>               |<-    extszie   ->|
>               wddddddddddddddddddd (pagecache)
>               WUUUUUUUUUUUUUUUUUUU (disk)
>               ^                  ^
>            (new EOF)           old EOF  (where 'd' means the pagecache is dirty)
>                                         (where 'x' means the pagecache contianes user data)
                                                 ^^^
                                                  w
> 
> B:
>               |<-    extszie   ->|
>               zddddddddddddddddddd
>               ZUUUUUUUUUUUUUUUUUUU
>               ^                  ^
>            (new EOF)           old EOF  (where 'z' means the pagecache is zero)
> 
> C:
>               |<-    extszie   ->|
>               zwwwwwwwwwwwwwwwwwww
>               ZWWWWWWWWWWWWWWWWWWW
>               ^                  ^
>            (new EOF)           old EOF
> 
> D:
>               |<-    extszie   ->|
>               zzzzzzzzz
>               ZWWWWWWWWWWWWWWWWWWW
>               ^       ^          ^
>             new EOF   |       (old EOF)
>                       folio boundary
> 
> 
> Thanks,
> Yi.