Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry

Jan Kara <jack@xxxxxxx> · Wed, 11 Dec 2024 17:00:47 +0100

On Wed 11-12-24 15:59:51, Zhang Yi wrote:
> On 2024/12/10 20:57, Jan Kara wrote:
> > On Mon 09-12-24 16:32:41, Zhang Yi wrote:
> > b) evict all page cache in the affected range -> should stop writeback -
> >    *but* currently there's one case which could be problematic. Assume we
> >    do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
> >    the above and starts removing blocks, needs to restart transaction so it
> >    drops i_data_sem. Writeback starts for page N+1, needs to load extent
> >    block into memory, ext4_cache_extents() now loads back some extents
> >    covering range 0..N into extent status tree. 
> 
> This completely confuses me. Do you mention the case below,
> 
> There are many extent entries in the page range 0..N+1, for example,
> 
>    0                                  N N+1
>    |                                  |/
>   [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
>                 |      |
>                 N-a    N-b
> 
> Punch hole is removing each extent entries from N..0
> (ext4_ext_remove_space() removes blocks from end to start), and could
> drop i_data_sem just after manipulating(removing) the extent entry
> [N-a,N-b], At the same time, a concurrent writeback start write back
> page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
> and invalidate_lock. It may load back the extents 0..N-a into the
> extent status tree again while finding extent that contains N+1?

Yes, because when we load extents from extent tree, we insert all extents
from the leaf of the extent tree into extent status tree. That's what
ext4_cache_extents() call does.

> Finally it may left some stale extent status entries after punch hole
> is done?

Yes, there may be stale extents in the extent status tree when
ext4_ext_remove_space() returns. But punch hole in particular then does:

ext4_es_insert_extent(inode, first_block, hole_len, ~0,
                                      EXTENT_STATUS_HOLE, 0);

which overwrites these stale extents with appropriate information.

> If my understanding is correct, isn't that a problem that exists now?
> I mean without this patch series.

Yes, the situation isn't really related to your patches. But with your
patches we are starting to rely even more on extent status tree vs extent
tree consistecy. So I wanted to spell out this situation to verify new
problem isn't introduced and so that we create rules that handle this
situation well.

> >    So the only protection
> >    against using freed blocks is that nobody should be mapping anything in
> >    the range 0..N because we hold those locks & have evicted page cache.
> > 
> > So I think we need to also document, that anybody mapping blocks needs to
> > hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
> > ext4_map_blocks() to catch cases we missed. Asserting for page lock will
> > not be really doable but luckily only page writeback needs that so that can
> > get some extemption from the assert.
> 
> In the case above, it seems that merely holding a page lock is
> insufficient?

Well, holding page lock(s) for the range you are operating on is enough to
make sure there cannot be parallel operations on that range like truncate,
punch hole or similar, because they always remove the page cache before
starting their work and because they hold invalidate_lock, new pages cannot
be created while they are working.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR