Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Thu, 12 Dec 2024 10:32:01 +0800

On 2024/12/12 0:00, Jan Kara wrote:
> On Wed 11-12-24 15:59:51, Zhang Yi wrote:
>> On 2024/12/10 20:57, Jan Kara wrote:
>>> On Mon 09-12-24 16:32:41, Zhang Yi wrote:
>>> b) evict all page cache in the affected range -> should stop writeback -
>>>    *but* currently there's one case which could be problematic. Assume we
>>>    do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
>>>    the above and starts removing blocks, needs to restart transaction so it
>>>    drops i_data_sem. Writeback starts for page N+1, needs to load extent
>>>    block into memory, ext4_cache_extents() now loads back some extents
>>>    covering range 0..N into extent status tree. 
>>
>> This completely confuses me. Do you mention the case below,
>>
>> There are many extent entries in the page range 0..N+1, for example,
>>
>>    0                                  N N+1
>>    |                                  |/
>>   [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
>>                 |      |
>>                 N-a    N-b
>>
>> Punch hole is removing each extent entries from N..0
>> (ext4_ext_remove_space() removes blocks from end to start), and could
>> drop i_data_sem just after manipulating(removing) the extent entry
>> [N-a,N-b], At the same time, a concurrent writeback start write back
>> page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
>> and invalidate_lock. It may load back the extents 0..N-a into the
>> extent status tree again while finding extent that contains N+1?
> 
> Yes, because when we load extents from extent tree, we insert all extents
> from the leaf of the extent tree into extent status tree. That's what
> ext4_cache_extents() call does.
> 
>> Finally it may left some stale extent status entries after punch hole
>> is done?
> 
> Yes, there may be stale extents in the extent status tree when
> ext4_ext_remove_space() returns. But punch hole in particular then does:
> 
> ext4_es_insert_extent(inode, first_block, hole_len, ~0,
>                                       EXTENT_STATUS_HOLE, 0);
> 
> which overwrites these stale extents with appropriate information.
> 

Yes, that's correct! I missed this insert yesterday. It looks fine now,
as it holds the i_rwsem and invalidate_lock, and has evicted the page
cache in this case. Thanks a lot for your detail explanation. I will
add these document in my next iteration.

Thanks!
Yi.

>> If my understanding is correct, isn't that a problem that exists now?
>> I mean without this patch series.
> 
> Yes, the situation isn't really related to your patches. But with your
> patches we are starting to rely even more on extent status tree vs extent
> tree consistecy. So I wanted to spell out this situation to verify new
> problem isn't introduced and so that we create rules that handle this
> situation well.
> 
>>>    So the only protection
>>>    against using freed blocks is that nobody should be mapping anything in
>>>    the range 0..N because we hold those locks & have evicted page cache.
>>>
>>> So I think we need to also document, that anybody mapping blocks needs to
>>> hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
>>> ext4_map_blocks() to catch cases we missed. Asserting for page lock will
>>> not be really doable but luckily only page writeback needs that so that can
>>> get some extemption from the assert.
>>
>> In the case above, it seems that merely holding a page lock is
>> insufficient?
> 
> Well, holding page lock(s) for the range you are operating on is enough to
> make sure there cannot be parallel operations on that range like truncate,
> punch hole or similar, because they always remove the page cache before
> starting their work and because they hold invalidate_lock, new pages cannot
> be created while they are working.
> 
> 								Honza