On 2024/5/23 9:11, Dave Chinner wrote: > On Wed, May 22, 2024 at 09:57:13AM +0800, Zhang Yi wrote: >> On 2024/5/21 10:38, Dave Chinner wrote: >>> We can do all this with a single writeback operation if we are a >>> little bit smarter about the order of operations we perform and we >>> are a little bit smarter in iomap about zeroing dirty pages in the >>> page cache: >>> >>> 1. change iomap_zero_range() to do the right thing with >>> dirty unwritten and cow extents (the patch I've been working >>> on). >>> >>> 2. pass the range to be zeroed into iomap_truncate_page() >>> (the fundamental change being made here). >>> >>> 3. zero the required range *through the page cache* >>> (iomap_zero_range() already does this). >>> >>> 4. write back the XFS inode from ip->i_disk_size to the end >>> of the range zeroed by iomap_truncate_page() >>> (xfs_setattr_size() already does this). >>> >>> 5. i_size_write(newsize); >>> >>> 6. invalidate_inode_pages2_range(newsize, -1) to trash all >>> the page cache beyond the new EOF without doing any zeroing >>> as we've already done all the zeroing needed to the page >>> cache through iomap_truncate_page(). >>> >>> >>> The patch I'm working on for step 1 is below. It still needs to be >>> extended to handle the cow case, but I'm unclear on how to exercise >>> that case so I haven't written the code to do it. The rest of it is >>> just rearranging the code that we already use just to get the order >>> of operations right. The only notable change in behaviour is using >>> invalidate_inode_pages2_range() instead of truncate_pagecache(), >>> because we don't want the EOF page to be dirtied again once we've >>> already written zeroes to disk.... >>> >> >> Indeed, this sounds like the best solution. Since Darrick recommended >> that we could fix the stale data exposure on realtime inode issue by >> convert the tail extent to unwritten, I suppose we could do this after >> fixing the problem. > > We also need to fix the truncate issue for the upcoming forced > alignment feature (for atomic writes), and in that case we are > required to write zeroes to the entire tail extent. i.e. forced > alignment does not allow partial unwritten extent conversion of > the EOF extent. > Yes, right. I noticed that feature also needs to fix. > Hence I think we want to fix the problem by zeroing the entire EOF > extent first, then optimise the large rtextsize case to use > unwritten extents if that tail zeroing proves to be a performance > issue. > > I say "if" because the large rtextsize case will still need to write > zeroes for the fsb that spans EOF. Adding conversion of the rest of > the extent to unwritten may well be more expensive (in terms of both > CPU and IO requirements for the transactional metadata updates) than > just submitting a slightly larger IO containing real zeroes and > leaving it as a written extent.... > Yeah, if the rtextsize if not large (in most cases), I'm pretty sure that writing zeros would better. If the rtextsize is large enough, I think it deserves a performance test. Thanks, Yi.