On 2024/12/10 20:57, Jan Kara wrote: > On Mon 09-12-24 16:32:41, Zhang Yi wrote: >> On 2024/12/7 0:21, Jan Kara wrote: >>>>> I think you'll need to use atomic_t and appropriate functions here. >>>>> >>>>>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, >>>>>> BUG_ON(end < lblk); >>>>>> WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED); >>>>>> >>>>>> + ext4_es_inc_seq(inode); >>>>> >>>>> I'm somewhat wondering: Are extent status tree modifications the right >>>>> place to advance the sequence counter? The counter needs to advance >>>>> whenever the mapping information changes. This means that we'd be >>>>> needlessly advancing the counter (and thus possibly forcing retries) when >>>>> we are just adding new information from ordinary extent tree into cache. >>>>> Also someone can be doing extent tree manipulations without touching extent >>>>> status tree (if the information was already pruned from there). >>>> >>>> Sorry, I don't quite understand here. IIUC, we can't modify the extent >>>> tree without also touching extent status tree; otherwise, the extent >>>> status tree will become stale, potentially leading to undesirable and >>>> unexpected outcomes later on, as the extent lookup paths rely on and >>>> always trust the status tree. If this situation happens, would it be >>>> considered a bug? Additionally, I have checked the code but didn't find >>>> any concrete cases where this could happen. Was I overlooked something? >>> >>> What I'm worried about is that this seems a bit fragile because e.g. in >>> ext4_collapse_range() we do: >>> >>> ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start) >>> <now go and manipulate the extent tree> >>> >>> So if somebody managed to sneak in between ext4_es_remove_extent() and >>> the extent tree manipulation, he could get a block mapping which is shortly >>> after invalidated by the extent tree changes. And as I'm checking now, >>> writeback code *can* sneak in there because during extent tree >>> manipulations we call ext4_datasem_ensure_credits() which can drop >>> i_data_sem to restart a transaction. >>> >>> Now we do writeout & invalidate page cache before we start to do these >>> extent tree dances so I don't see how this could lead to *actual* use >>> after free issues but it makes me somewhat nervous. So that's why I'd like >>> to have some clear rules from which it is obvious that the counter makes >>> sure we do not use stale mappings. >> >> Yes, I see. I think the rule should be as follows: >> >> First, when the iomap infrastructure is creating or querying file >> mapping information, we must ensure that the mapping information >> always passes through the extent status tree, which means >> ext4_map_blocks(), ext4_map_query_blocks(), and >> ext4_map_create_blocks() should cache the extent status entries that >> we intend to use. > > OK, this currently holds. There's just one snag that during fastcommit > replay ext4_es_insert_extent() doesn't do anything. I don't think there's > any race possible during that stage but it's another case to think about. OK. > >> Second, when updating the extent tree, we must hold the i_data_sem in >> write mode and update the extent status tree atomically. > > Fine. > >> Additionally, >> if we cannot update the extent tree while holding a single i_data_sem, >> we should first remove all related extent status entries within the >> specified range, then manipulate the extent tree, ensuring that the >> extent status entries are always up-to-date if they exist (as >> ext4_collapse_range() does). > > In this case, I think we need to provide more details. In particular I > would require that in all such cases we must: > a) hold i_rwsem exclusively and hold invalidate_lock exclusively -> > provides exclusion against page faults, reads, writes Yes. > b) evict all page cache in the affected range -> should stop writeback - > *but* currently there's one case which could be problematic. Assume we > do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of > the above and starts removing blocks, needs to restart transaction so it > drops i_data_sem. Writeback starts for page N+1, needs to load extent > block into memory, ext4_cache_extents() now loads back some extents > covering range 0..N into extent status tree. This completely confuses me. Do you mention the case below, There are many extent entries in the page range 0..N+1, for example, 0 N N+1 | |/ [www][wwwwww][wwwwwwww]...[wwwww][wwww]... | | N-a N-b Punch hole is removing each extent entries from N..0 (ext4_ext_remove_space() removes blocks from end to start), and could drop i_data_sem just after manipulating(removing) the extent entry [N-a,N-b], At the same time, a concurrent writeback start write back page N+1 since the writeback only hold page lock, doesn't hold i_rwsem and invalidate_lock. It may load back the extents 0..N-a into the extent status tree again while finding extent that contains N+1? Finally it may left some stale extent status entries after punch hole is done? If my understanding is correct, isn't that a problem that exists now? I mean without this patch series. > So the only protection > against using freed blocks is that nobody should be mapping anything in > the range 0..N because we hold those locks & have evicted page cache. > > So I think we need to also document, that anybody mapping blocks needs to > hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in > ext4_map_blocks() to catch cases we missed. Asserting for page lock will > not be really doable but luckily only page writeback needs that so that can > get some extemption from the assert. In the case above, it seems that merely holding a page lock is insufficient? > >> Finally, if we want to manipulate the extent tree without caching, we >> should also remove the extent status entries first. > > Based on the above, I don't think this is really needed. We only must make > sure that after all extent tree updates are done and before we release > invalidate_lock, all extents from extent status tree in the modified range > must be evicted / replaced to match reality. > Yeah, I agree with you. Thanks, Yi.