Re: [PATCH v4 03/10] ext4: don't write back data before punch hole in nojournal mode

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Wed, 18 Dec 2024 15:10:36 +0800

On 2024/12/17 22:31, Ojaswin Mujoo wrote:
> On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>
>> There is no need to write back all data before punching a hole in
>> non-journaled mode since it will be dropped soon after removing space.
>> Therefore, the call to filemap_write_and_wait_range() can be eliminated.
> 
> Hi, sorry I'm a bit late to this however following the discussion here
> [1], I believe the initial concern was that we don't in PATCH v1 01/10 
> was that after truncating the pagecache, the ext4_alloc_file_blocks()
> call might fail with errors like EIO, ENOMEM etc leading to inconsistent
> data. 
> 
> Is my understanding correct that  we realised that these are very rare
> cases and are not worth the performance penalty of writeback? In which
> case, is it really okay to just let the scope for corruption exist even
> though its rare. There might be some other error cases we might be
> missing which might be more easier to hit. For eg I think we can also
> fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to
> unwritten extent conversion causing an extent split leading to  extent
> tree node allocation. (Maybe can be avoided by using PRE_IO with
> EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call)
> 
> So does it make sense to retain the writeback behavior or am I just
> being paranoid :) 
> 

Hi, Ojaswin!

Yeah, from my point of view, ENOSPC could happen, and it may be more
likely to happen if we intentionally create conditions for it. However,
all the efforts we can make at this point are merely best efforts and
reduce the probability. We cannot 100% guarantee it will not happen,
even if we write back the whole range before manipulating extents and
blocks. This is because we do not accurately reserve space for extents
split. Additionally, In ext4_punch_hole(), we have used 'nofail' flag
while freeing blocks to reduce the possibility of ENOSPC. So I suppose
it's fine by now, but we may need to implement additional measures if
we truly want to resolve the issue completely.

Thanks,
Yi.

> 
>> Besides, similar to ext4_zero_range(), we must address the case of
>> partially punched folios when block size < page size. It is essential to
>> remove writable userspace mappings to ensure that the folio can be
>> faulted again during subsequent mmap write access.
>>
>> In journaled mode, we need to write dirty pages out before discarding
>> page cache in case of crash before committing the freeing data
>> transaction, which could expose old, stale data, even if synchronization
>> has been performed.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>> ---
>>  fs/ext4/inode.c | 18 +++++-------------
>>  1 file changed, 5 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index bf735d06b621..a5ba2b71d508 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  
>>  	trace_ext4_punch_hole(inode, offset, length, 0);
>>  
>> -	/*
>> -	 * Write out all dirty pages to avoid race conditions
>> -	 * Then release them.
>> -	 */
>> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>> -		ret = filemap_write_and_wait_range(mapping, offset,
>> -						   offset + length - 1);
>> -		if (ret)
>> -			return ret;
>> -	}
>> -
>>  	inode_lock(inode);
>>  
>>  	/* No need to punch hole beyond i_size */
>> @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>>  		if (ret)
>>  			goto out_dio;
>> -		truncate_pagecache_range(inode, first_block_offset,
>> -					 last_block_offset);
>> +
>> +		ret = ext4_truncate_page_cache_block_range(inode,
>> +				first_block_offset, last_block_offset + 1);
>> +		if (ret)
>> +			goto out_dio;
>>  	}
>>  
>>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>> -- 
>> 2.46.1
>>