Re: [RFC PATCH v4 24/34] ext4: implement buffered write iomap path

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Tue, 7 May 2024 13:10:53 +0800

On 2024/5/7 7:19, Dave Chinner wrote:
> On Mon, May 06, 2024 at 07:44:44PM +0800, Zhang Yi wrote:
>> On 2024/5/1 16:33, Dave Chinner wrote:
>>> On Wed, May 01, 2024 at 06:11:13PM +1000, Dave Chinner wrote:
>>>> On Wed, Apr 10, 2024 at 10:29:38PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>>
>>>>> Implement buffered write iomap path, use ext4_da_map_blocks() to map
>>>>> delalloc extents and add ext4_iomap_get_blocks() to allocate blocks if
>>>>> delalloc is disabled or free space is about to run out.
>>>>>
>>>>> Note that we always allocate unwritten extents for new blocks in the
>>>>> iomap write path, this means that the allocation type is no longer
>>>>> controlled by the dioread_nolock mount option. After that, we could
>>>>> postpone the i_disksize updating to the writeback path, and drop journal
>>>>> handle in the buffered dealloc write path completely.
>>> .....
>>>>> +/*
>>>>> + * Drop the staled delayed allocation range from the write failure,
>>>>> + * including both start and end blocks. If not, we could leave a range
>>>>> + * of delayed extents covered by a clean folio, it could lead to
>>>>> + * inaccurate space reservation.
>>>>> + */
>>>>> +static int ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
>>>>> +				     loff_t length)
>>>>> +{
>>>>> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
>>>>> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
>>>>>  	return 0;
>>>>>  }
>>>>>  
>>>>> +static int ext4_iomap_buffered_write_end(struct inode *inode, loff_t offset,
>>>>> +					 loff_t length, ssize_t written,
>>>>> +					 unsigned int flags,
>>>>> +					 struct iomap *iomap)
>>>>> +{
>>>>> +	handle_t *handle;
>>>>> +	loff_t end;
>>>>> +	int ret = 0, ret2;
>>>>> +
>>>>> +	/* delalloc */
>>>>> +	if (iomap->flags & IOMAP_F_EXT4_DELALLOC) {
>>>>> +		ret = iomap_file_buffered_write_punch_delalloc(inode, iomap,
>>>>> +			offset, length, written, ext4_iomap_punch_delalloc);
>>>>> +		if (ret)
>>>>> +			ext4_warning(inode->i_sb,
>>>>> +			     "Failed to clean up delalloc for inode %lu, %d",
>>>>> +			     inode->i_ino, ret);
>>>>> +		return ret;
>>>>> +	}
>>>>
>>>> Why are you creating a delalloc extent for the write operation and
>>>> then immediately deleting it from the extent tree once the write
>>>> operation is done?
>>>
>>> Ignore this, I mixed up the ext4_iomap_punch_delalloc() code
>>> directly above with iomap_file_buffered_write_punch_delalloc().
>>>
>>> In hindsight, iomap_file_buffered_write_punch_delalloc() is poorly
>>> named, as it is handling a short write situation which requires
>>> newly allocated delalloc blocks to be punched.
>>> iomap_file_buffered_write_finish() would probably be a better name
>>> for it....
>>>
>>>> Also, why do you need IOMAP_F_EXT4_DELALLOC? Isn't a delalloc iomap
>>>> set up with iomap->type = IOMAP_DELALLOC? Why can't that be used?
>>>
>>> But this still stands - the first thing
>>> iomap_file_buffered_write_punch_delalloc() is:
>>>
>>> 	if (iomap->type != IOMAP_DELALLOC)
>>>                 return 0;
>>>
>>
>> Thanks for the suggestion, the delalloc and non-delalloc write paths
>> share the same ->iomap_end() now (i.e. ext4_iomap_buffered_write_end()),
>> I use the IOMAP_F_EXT4_DELALLOC to identify the write path.
> 
> Again, you don't need that. iomap tracks newly allocated
> IOMAP_DELALLOC extents via the IOMAP_F_NEW flag that should be
> getting set in the ->iomap_begin() call when it creates a new
> delalloc extent.
> 
> Please look at the second check in
> iomap_file_buffered_write_punch_delalloc():
> 
> 	if (iomap->type != IOMAP_DELALLOC)
>                 return 0;
> 
>         /* If we didn't reserve the blocks, we're not allowed to punch them. */
>         if (!(iomap->flags & IOMAP_F_NEW))
>                 return 0;
> 
>> For
>> non-delalloc path, If we have allocated more blocks and copied less, we
>> should truncate extra blocks that newly allocated by ->iomap_begin().
> 
> Why? If they were allocated as unwritten, then you can just leave
> them there as unwritten extents, same as XFS. Keep in mind that if
> we get a short write, it is extremely likely the application is
> going to rewrite the remaining data immediately, so if we allocated
> blocks they are likely to still be needed, anyway....
> 

Make sense, we don't need to free the extra blocks beyond EOF since they
are unwritten, we can drop this handle for non-delalloc path on ext4 now.

>> If we use IOMAP_DELALLOC, we can't tell if the blocks are
>> pre-existing or newly allocated, we can't truncate the
>> pre-existing blocks, so I have to introduce IOMAP_F_EXT4_DELALLOC.
>> But if we split the delalloc and non-delalloc handler, we could
>> drop IOMAP_F_EXT4_DELALLOC.
> 
> As per above: IOMAP_F_NEW tells us -exactly- this.
> 
> IOMAP_F_NEW should be set on any newly allocated block - delalloc or
> real - because that's the flag that tells the iomap infrastructure
> whether zero-around is needed for partial block writes. If ext4 is
> not setting this flag on delalloc regions allocated by
> ->iomap_begin(), then that's a serious bug.
> 
>> I also checked xfs, IIUC, xfs doesn't free the extra blocks beyond EOF
>> in xfs_buffered_write_iomap_end() for non-delalloc case since they will
>> be freed by xfs_free_eofblocks in some other inactive paths, like
>> xfs_release()/xfs_inactive()/..., is that right?
> 
> XFS doesn't care about real blocks beyond EOF existing -
> xfs_free_eofblocks() is an optimistic operation that does not
> guarantee that it will remove blocks beyond EOF. Similarly, we don't
> care about real blocks within EOF because we alway allocate data
> extents as unwritten, so we don't have any stale data exposure
> issues to worry about on short writes leaving allocated blocks
> behind.
> 
> OTOH, delalloc extents without dirty page cache pages over them
> cannot be allowed to exist. Without dirty pages, there is no trigger
> to convert those to real extents (i.e. nothing to write back). Hence
> the only sane thing that can be done with them on a write error or
> short write is remove them in the context where they were created.
> 
> This is the only reason that the
> iomap_file_buffered_write_punch_delalloc() exists - it abstracts
> this nasty corner case away from filesystems that support delalloc
> so they don't have to worry about getting this right. That's whole
> point of having delalloc aware infrastructure - individual
> filesysetms don't need to handle all these weird corner cases
> themselves because the infrastructure takes care of them...
> 

Yeah, thanks for the explanation. The iomap_file_buffered_write_punch_delalloc()
is very useful, it find pages that have dirty data still pending in the page
cache, punch out all the delalloc blocks beside those blocks. I realized that
it is used to fix a race condition between either writeback or mmap page
faults that xfs encountered [1].

We will meet the same problem for ext3 and ext2 which are not extent based.
Their new allocated blocks were written, we need to free them if we get a short
write, but we can't simply do it through ext2_write_failed() and
ext4_truncate_failed_write(), we still need to use
iomap_file_buffered_write_punch_delalloc().

[1] https://lore.kernel.org/all/20221123055812.747923-6-david@xxxxxxxxxxxxx/

Thanks,
Yi.