Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Wed, 11 Dec 2024 15:59:51 +0800

On 2024/12/10 20:57, Jan Kara wrote:
> On Mon 09-12-24 16:32:41, Zhang Yi wrote:
>> On 2024/12/7 0:21, Jan Kara wrote:
>>>>> I think you'll need to use atomic_t and appropriate functions here.
>>>>>
>>>>>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>>>>>>  	BUG_ON(end < lblk);
>>>>>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>>>>>>  
>>>>>> +	ext4_es_inc_seq(inode);
>>>>>
>>>>> I'm somewhat wondering: Are extent status tree modifications the right
>>>>> place to advance the sequence counter? The counter needs to advance
>>>>> whenever the mapping information changes. This means that we'd be
>>>>> needlessly advancing the counter (and thus possibly forcing retries) when
>>>>> we are just adding new information from ordinary extent tree into cache.
>>>>> Also someone can be doing extent tree manipulations without touching extent
>>>>> status tree (if the information was already pruned from there). 
>>>>
>>>> Sorry, I don't quite understand here. IIUC, we can't modify the extent
>>>> tree without also touching extent status tree; otherwise, the extent
>>>> status tree will become stale, potentially leading to undesirable and
>>>> unexpected outcomes later on, as the extent lookup paths rely on and
>>>> always trust the status tree. If this situation happens, would it be
>>>> considered a bug? Additionally, I have checked the code but didn't find
>>>> any concrete cases where this could happen. Was I overlooked something?
>>>
>>> What I'm worried about is that this seems a bit fragile because e.g. in
>>> ext4_collapse_range() we do:
>>>
>>> ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
>>> <now go and manipulate the extent tree>
>>>
>>> So if somebody managed to sneak in between ext4_es_remove_extent() and
>>> the extent tree manipulation, he could get a block mapping which is shortly
>>> after invalidated by the extent tree changes. And as I'm checking now,
>>> writeback code *can* sneak in there because during extent tree
>>> manipulations we call ext4_datasem_ensure_credits() which can drop
>>> i_data_sem to restart a transaction.
>>>
>>> Now we do writeout & invalidate page cache before we start to do these
>>> extent tree dances so I don't see how this could lead to *actual* use
>>> after free issues but it makes me somewhat nervous. So that's why I'd like
>>> to have some clear rules from which it is obvious that the counter makes
>>> sure we do not use stale mappings.
>>
>> Yes, I see. I think the rule should be as follows:
>>
>> First, when the iomap infrastructure is creating or querying file
>> mapping information, we must ensure that the mapping information
>> always passes through the extent status tree, which means
>> ext4_map_blocks(), ext4_map_query_blocks(), and
>> ext4_map_create_blocks() should cache the extent status entries that
>> we intend to use.
> 
> OK, this currently holds. There's just one snag that during fastcommit
> replay ext4_es_insert_extent() doesn't do anything. I don't think there's
> any race possible during that stage but it's another case to think about.

OK.

> 
>> Second, when updating the extent tree, we must hold the i_data_sem in
>> write mode and update the extent status tree atomically.
> 
> Fine.
> 
>> Additionally,
>> if we cannot update the extent tree while holding a single i_data_sem,
>> we should first remove all related extent status entries within the
>> specified range, then manipulate the extent tree, ensuring that the
>> extent status entries are always up-to-date if they exist (as
>> ext4_collapse_range() does).
> 
> In this case, I think we need to provide more details. In particular I
> would require that in all such cases we must:
> a) hold i_rwsem exclusively and hold invalidate_lock exclusively ->
>    provides exclusion against page faults, reads, writes

Yes.

> b) evict all page cache in the affected range -> should stop writeback -
>    *but* currently there's one case which could be problematic. Assume we
>    do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
>    the above and starts removing blocks, needs to restart transaction so it
>    drops i_data_sem. Writeback starts for page N+1, needs to load extent
>    block into memory, ext4_cache_extents() now loads back some extents
>    covering range 0..N into extent status tree. 

This completely confuses me. Do you mention the case below,

There are many extent entries in the page range 0..N+1, for example,

   0                                  N N+1
   |                                  |/
  [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
                |      |
                N-a    N-b

Punch hole is removing each extent entries from N..0
(ext4_ext_remove_space() removes blocks from end to start), and could
drop i_data_sem just after manipulating(removing) the extent entry
[N-a,N-b], At the same time, a concurrent writeback start write back
page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
and invalidate_lock. It may load back the extents 0..N-a into the
extent status tree again while finding extent that contains N+1?
Finally it may left some stale extent status entries after punch hole
is done?

If my understanding is correct, isn't that a problem that exists now?
I mean without this patch series.

>    So the only protection
>    against using freed blocks is that nobody should be mapping anything in
>    the range 0..N because we hold those locks & have evicted page cache.
> 
> So I think we need to also document, that anybody mapping blocks needs to
> hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
> ext4_map_blocks() to catch cases we missed. Asserting for page lock will
> not be really doable but luckily only page writeback needs that so that can
> get some extemption from the assert.

In the case above, it seems that merely holding a page lock is
insufficient?

> 
>> Finally, if we want to manipulate the extent tree without caching, we
>> should also remove the extent status entries first.
> 
> Based on the above, I don't think this is really needed. We only must make
> sure that after all extent tree updates are done and before we release
> invalidate_lock, all extents from extent status tree in the modified range
> must be evicted / replaced to match reality.
> 
Yeah, I agree with you.

Thanks,
Yi.