Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 23 May 2024 11:11:04 +1000

On Wed, May 22, 2024 at 09:57:13AM +0800, Zhang Yi wrote:
> On 2024/5/21 10:38, Dave Chinner wrote:
> > We can do all this with a single writeback operation if we are a
> > little bit smarter about the order of operations we perform and we
> > are a little bit smarter in iomap about zeroing dirty pages in the
> > page cache:
> > 
> > 	1. change iomap_zero_range() to do the right thing with
> > 	dirty unwritten and cow extents (the patch I've been working
> > 	on).
> > 
> > 	2. pass the range to be zeroed into iomap_truncate_page()
> > 	(the fundamental change being made here).
> > 
> > 	3. zero the required range *through the page cache*
> > 	(iomap_zero_range() already does this).
> > 
> > 	4. write back the XFS inode from ip->i_disk_size to the end
> > 	of the range zeroed by iomap_truncate_page()
> > 	(xfs_setattr_size() already does this).
> > 
> > 	5. i_size_write(newsize);
> > 
> > 	6. invalidate_inode_pages2_range(newsize, -1) to trash all
> > 	the page cache beyond the new EOF without doing any zeroing
> > 	as we've already done all the zeroing needed to the page
> > 	cache through iomap_truncate_page().
> > 
> > 
> > The patch I'm working on for step 1 is below. It still needs to be
> > extended to handle the cow case, but I'm unclear on how to exercise
> > that case so I haven't written the code to do it. The rest of it is
> > just rearranging the code that we already use just to get the order
> > of operations right. The only notable change in behaviour is using
> > invalidate_inode_pages2_range() instead of truncate_pagecache(),
> > because we don't want the EOF page to be dirtied again once we've
> > already written zeroes to disk....
> > 
> 
> Indeed, this sounds like the best solution. Since Darrick recommended
> that we could fix the stale data exposure on realtime inode issue by
> convert the tail extent to unwritten, I suppose we could do this after
> fixing the problem.

We also need to fix the truncate issue for the upcoming forced
alignment feature (for atomic writes), and in that case we are
required to write zeroes to the entire tail extent. i.e. forced
alignment does not allow partial unwritten extent conversion of
the EOF extent.

Hence I think we want to fix the problem by zeroing the entire EOF
extent first, then optimise the large rtextsize case to use
unwritten extents if that tail zeroing proves to be a performance
issue.

I say "if" because the large rtextsize case will still need to write
zeroes for the fsb that spans EOF. Adding conversion of the rest of
the extent to unwritten may well be more expensive (in terms of both
CPU and IO requirements for the transactional metadata updates) than
just submitting a slightly larger IO containing real zeroes and
leaving it as a written extent....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx