Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Thu, 23 May 2024 10:00:02 +0800

On 2024/5/23 9:11, Dave Chinner wrote:
> On Wed, May 22, 2024 at 09:57:13AM +0800, Zhang Yi wrote:
>> On 2024/5/21 10:38, Dave Chinner wrote:
>>> We can do all this with a single writeback operation if we are a
>>> little bit smarter about the order of operations we perform and we
>>> are a little bit smarter in iomap about zeroing dirty pages in the
>>> page cache:
>>>
>>> 	1. change iomap_zero_range() to do the right thing with
>>> 	dirty unwritten and cow extents (the patch I've been working
>>> 	on).
>>>
>>> 	2. pass the range to be zeroed into iomap_truncate_page()
>>> 	(the fundamental change being made here).
>>>
>>> 	3. zero the required range *through the page cache*
>>> 	(iomap_zero_range() already does this).
>>>
>>> 	4. write back the XFS inode from ip->i_disk_size to the end
>>> 	of the range zeroed by iomap_truncate_page()
>>> 	(xfs_setattr_size() already does this).
>>>
>>> 	5. i_size_write(newsize);
>>>
>>> 	6. invalidate_inode_pages2_range(newsize, -1) to trash all
>>> 	the page cache beyond the new EOF without doing any zeroing
>>> 	as we've already done all the zeroing needed to the page
>>> 	cache through iomap_truncate_page().
>>>
>>>
>>> The patch I'm working on for step 1 is below. It still needs to be
>>> extended to handle the cow case, but I'm unclear on how to exercise
>>> that case so I haven't written the code to do it. The rest of it is
>>> just rearranging the code that we already use just to get the order
>>> of operations right. The only notable change in behaviour is using
>>> invalidate_inode_pages2_range() instead of truncate_pagecache(),
>>> because we don't want the EOF page to be dirtied again once we've
>>> already written zeroes to disk....
>>>
>>
>> Indeed, this sounds like the best solution. Since Darrick recommended
>> that we could fix the stale data exposure on realtime inode issue by
>> convert the tail extent to unwritten, I suppose we could do this after
>> fixing the problem.
> 
> We also need to fix the truncate issue for the upcoming forced
> alignment feature (for atomic writes), and in that case we are
> required to write zeroes to the entire tail extent. i.e. forced
> alignment does not allow partial unwritten extent conversion of
> the EOF extent.
> 

Yes, right. I noticed that feature also needs to fix.

> Hence I think we want to fix the problem by zeroing the entire EOF
> extent first, then optimise the large rtextsize case to use
> unwritten extents if that tail zeroing proves to be a performance
> issue.
> 
> I say "if" because the large rtextsize case will still need to write
> zeroes for the fsb that spans EOF. Adding conversion of the rest of
> the extent to unwritten may well be more expensive (in terms of both
> CPU and IO requirements for the transactional metadata updates) than
> just submitting a slightly larger IO containing real zeroes and
> leaving it as a written extent....
> 

Yeah, if the rtextsize if not large (in most cases), I'm pretty sure
that writing zeros would better. If the rtextsize is large enough, I
think it deserves a performance test.

Thanks,
Yi.