Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry

Jan Kara <jack@xxxxxxx> · Tue, 10 Dec 2024 13:57:26 +0100

On Mon 09-12-24 16:32:41, Zhang Yi wrote:
> On 2024/12/7 0:21, Jan Kara wrote:
> >>> I think you'll need to use atomic_t and appropriate functions here.
> >>>
> >>>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> >>>>  	BUG_ON(end < lblk);
> >>>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
> >>>>  
> >>>> +	ext4_es_inc_seq(inode);
> >>>
> >>> I'm somewhat wondering: Are extent status tree modifications the right
> >>> place to advance the sequence counter? The counter needs to advance
> >>> whenever the mapping information changes. This means that we'd be
> >>> needlessly advancing the counter (and thus possibly forcing retries) when
> >>> we are just adding new information from ordinary extent tree into cache.
> >>> Also someone can be doing extent tree manipulations without touching extent
> >>> status tree (if the information was already pruned from there). 
> >>
> >> Sorry, I don't quite understand here. IIUC, we can't modify the extent
> >> tree without also touching extent status tree; otherwise, the extent
> >> status tree will become stale, potentially leading to undesirable and
> >> unexpected outcomes later on, as the extent lookup paths rely on and
> >> always trust the status tree. If this situation happens, would it be
> >> considered a bug? Additionally, I have checked the code but didn't find
> >> any concrete cases where this could happen. Was I overlooked something?
> > 
> > What I'm worried about is that this seems a bit fragile because e.g. in
> > ext4_collapse_range() we do:
> > 
> > ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
> > <now go and manipulate the extent tree>
> > 
> > So if somebody managed to sneak in between ext4_es_remove_extent() and
> > the extent tree manipulation, he could get a block mapping which is shortly
> > after invalidated by the extent tree changes. And as I'm checking now,
> > writeback code *can* sneak in there because during extent tree
> > manipulations we call ext4_datasem_ensure_credits() which can drop
> > i_data_sem to restart a transaction.
> > 
> > Now we do writeout & invalidate page cache before we start to do these
> > extent tree dances so I don't see how this could lead to *actual* use
> > after free issues but it makes me somewhat nervous. So that's why I'd like
> > to have some clear rules from which it is obvious that the counter makes
> > sure we do not use stale mappings.
> 
> Yes, I see. I think the rule should be as follows:
> 
> First, when the iomap infrastructure is creating or querying file
> mapping information, we must ensure that the mapping information
> always passes through the extent status tree, which means
> ext4_map_blocks(), ext4_map_query_blocks(), and
> ext4_map_create_blocks() should cache the extent status entries that
> we intend to use.

OK, this currently holds. There's just one snag that during fastcommit
replay ext4_es_insert_extent() doesn't do anything. I don't think there's
any race possible during that stage but it's another case to think about.

> Second, when updating the extent tree, we must hold the i_data_sem in
> write mode and update the extent status tree atomically.

Fine.

> Additionally,
> if we cannot update the extent tree while holding a single i_data_sem,
> we should first remove all related extent status entries within the
> specified range, then manipulate the extent tree, ensuring that the
> extent status entries are always up-to-date if they exist (as
> ext4_collapse_range() does).

In this case, I think we need to provide more details. In particular I
would require that in all such cases we must:
a) hold i_rwsem exclusively and hold invalidate_lock exclusively ->
   provides exclusion against page faults, reads, writes
b) evict all page cache in the affected range -> should stop writeback -
   *but* currently there's one case which could be problematic. Assume we
   do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
   the above and starts removing blocks, needs to restart transaction so it
   drops i_data_sem. Writeback starts for page N+1, needs to load extent
   block into memory, ext4_cache_extents() now loads back some extents
   covering range 0..N into extent status tree. So the only protection
   against using freed blocks is that nobody should be mapping anything in
   the range 0..N because we hold those locks & have evicted page cache.

So I think we need to also document, that anybody mapping blocks needs to
hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
ext4_map_blocks() to catch cases we missed. Asserting for page lock will
not be really doable but luckily only page writeback needs that so that can
get some extemption from the assert.

> Finally, if we want to manipulate the extent tree without caching, we
> should also remove the extent status entries first.

Based on the above, I don't think this is really needed. We only must make
sure that after all extent tree updates are done and before we release
invalidate_lock, all extents from extent status tree in the modified range
must be evicted / replaced to match reality.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR