Re: [PATCH v2 06/10] ext4: update delalloc data reserve spcae in ext4_es_insert_extent()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri 09-08-24 11:35:49, Zhang Yi wrote:
> On 2024/8/9 2:36, Jan Kara wrote:
> > On Thu 08-08-24 19:18:30, Zhang Yi wrote:
> >> On 2024/8/8 1:41, Jan Kara wrote:
> >>> On Fri 02-08-24 19:51:16, Zhang Yi wrote:
> >>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>>>
> >>>> Now that we update data reserved space for delalloc after allocating
> >>>> new blocks in ext4_{ind|ext}_map_blocks(), and if bigalloc feature is
> >>>> enabled, we also need to query the extents_status tree to calculate the
> >>>> exact reserved clusters. This is complicated now and it appears that
> >>>> it's better to do this job in ext4_es_insert_extent(), because
> >>>> __es_remove_extent() have already count delalloc blocks when removing
> >>>> delalloc extents and __revise_pending() return new adding pending count,
> >>>> we could update the reserved blocks easily in ext4_es_insert_extent().
> >>>>
> >>>> Thers is one special case needs to concern is the quota claiming, when
> >>>> bigalloc is enabled, if the delayed cluster allocation has been raced
> >>>> by another no-delayed allocation(e.g. from fallocate) which doesn't
> >>>> cover the delayed blocks:
> >>>>
> >>>>   |<       one cluster       >|
> >>>>   hhhhhhhhhhhhhhhhhhhdddddddddd
> >>>>   ^            ^
> >>>>   |<          >| < fallocate this range, don't claim quota again
> >>>>
> >>>> We can't claim quota as usual because the fallocate has already claimed
> >>>> it in ext4_mb_new_blocks(), we could notice this case through the
> >>>> removed delalloc blocks count.
> >>>>
> >>>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>> ...
> >>>> @@ -926,9 +928,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> >>>>  			__free_pending(pr);
> >>>>  			pr = NULL;
> >>>>  		}
> >>>> +		pending = err3;
> >>>>  	}
> >>>>  error:
> >>>>  	write_unlock(&EXT4_I(inode)->i_es_lock);
> >>>> +	/*
> >>>> +	 * Reduce the reserved cluster count to reflect successful deferred
> >>>> +	 * allocation of delayed allocated clusters or direct allocation of
> >>>> +	 * clusters discovered to be delayed allocated.  Once allocated, a
> >>>> +	 * cluster is not included in the reserved count.
> >>>> +	 *
> >>>> +	 * When bigalloc is enabled, allocating non-delayed allocated blocks
> >>>> +	 * which belong to delayed allocated clusters (from fallocate, filemap,
> >>>> +	 * DIO, or clusters allocated when delalloc has been disabled by
> >>>> +	 * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(),
> >>>> +	 * so release the quota reservations made for any previously delayed
> >>>> +	 * allocated clusters.
> >>>> +	 */
> >>>> +	resv_used = rinfo.delonly_cluster + pending;
> >>>> +	if (resv_used)
> >>>> +		ext4_da_update_reserve_space(inode, resv_used,
> >>>> +					     rinfo.delonly_block);
> >>>
> >>> I'm not sure I understand here. We are inserting extent into extent status
> >>> tree. We are replacing resv_used clusters worth of space with delayed
> >>> allocation reservation with normally allocated clusters so we need to
> >>> release the reservation (mballoc already reduced freeclusters counter).
> >>> That I understand. In normal case we should also claim quota because we are
> >>> converting from reserved into allocated state. Now if we allocated blocks
> >>> under this range (e.g. from fallocate()) without
> >>> EXT4_GET_BLOCKS_DELALLOC_RESERVE, we need to release quota reservation here
> >>> instead of claiming it. But I fail to see how rinfo.delonly_block > 0 is
> >>> related to whether EXT4_GET_BLOCKS_DELALLOC_RESERVE was set when allocating
> >>> blocks for this extent or not.
> >>
> >> Oh, this is really complicated due to the bigalloc feature, please let me
> >> explain it more clearly by listing all related situations.
> >>
> >> There are 2 types of paths of allocating delayed/reserved cluster:
> >> 1. Normal case, normally allocate delayed clusters from the write back path.
> >> 2. Special case, allocate blocks under this delayed range, e.g. from
> >>    fallocate().
> >>
> >> There are 4 situations below:
> >>
> >> A. bigalloc is disabled. This case is simple, after path 2, we don't need
> >>    to distinguish path 1 and 2, when calling ext4_es_insert_extent(), we
> >>    set EXT4_GET_BLOCKS_DELALLOC_RESERVE after EXT4_MAP_DELAYED bit is
> >>    detected. If the flag is set, we must be replacing a delayed extent and
> >>    rinfo.delonly_block must be > 0. So rinfo.delonly_block > 0 is equal
> >>    to set EXT4_GET_BLOCKS_DELALLOC_RESERVE.
> > 
> > Right. So fallocate() will call ext4_map_blocks() and
> > ext4_es_lookup_extent() will find delayed extent and set EXT4_MAP_DELAYED
> > which you (due to patch 2 of this series) transform into
> > EXT4_GET_BLOCKS_DELALLOC_RESERVE. We used to update the delalloc
> > accounting through in ext4_ext_map_blocks() but this patch moved the update
> > to ext4_es_insert_extent(). But there is one cornercase even here AFAICT:
> > 
> > Suppose fallocate is called for range 0..16k, we have delalloc extent at
> > 8k..16k. In this case ext4_map_blocks() at block 0 will not find the
> > delalloc extent but ext4_ext_map_blocks() will allocate 16k from mballoc
> > without using delalloc reservation but then ext4_es_insert_extent() will
> > still have rinfo.delonly_block > 0 so we claim the quota reservation
> > instead of releasing it?
> > 
> 
> After commit 6430dea07e85 ("ext4: correct the hole length returned by
> ext4_map_blocks()"), the fallocate range 0-16K would be divided into two
> rounds. When we first calling ext4_map_blocks() with 0-16K, the map range
> will be corrected to 0-8k by ext4_ext_determine_insert_hole() and the
> allocating range should not cover any delayed range.

Eww, subtle, subtle, subtle... And isn't this also racy? We drop i_data_sem
in ext4_map_blocks() after we do the initial lookup. So there can be some
changes to both the extent tree and extent status tree before we grab
i_data_sem again for the allocation. We hold inode_lock so there can be
only writeback and page faults racing with us but e.g. ext4_page_mkwrite()
-> block_page_mkwrite -> ext4_da_get_block_prep() -> ext4_da_map_blocks()
can add delayed extent into extent status tree in that window causing
breakage, can't it?

> Then
> ext4_alloc_file_blocks() will call ext4_map_blocks() again to allocate
> 8K-16K in the second round, in this round, we are allocating a real
> delayed range. Please below graph for details,
> 
> ext4_alloc_file_blocks() //0-16K
>  ext4_map_blocks()  //0-16K
>   ext4_es_lookup_extent() //find nothing
>    ext4_ext_map_blocks(0)
>     ext4_ext_determine_insert_hole() //change map range to 0-8K
>    ext4_ext_map_blocks(EXT4_GET_BLOCKS_CREATE) //allocate blocks under hole
>  ext4_map_blocks()  //8-16K
>   ext4_es_lookup_extent() //find delayed extent
>   ext4_ext_map_blocks(EXT4_GET_BLOCKS_CREATE)
>     //allocate blocks under a whole delayed range,
>     //use rinfo.delonly_block > 0 is okay
> 
> Hence the allocating range can't mixed with delayed and non-delayed extent
> at a time, and the rinfo.delonly_block > 0 should work.

Besides the race above I agree. So either we need to trim mapping extent in
ext4_map_blocks() after re-acquiring i_data_sem or we need to deal with
unwritten extents that are partially delalloc. I'm more and more leaning
towards just passing the information whether delalloc was used or not to
extent status tree insertion. Because that can deal with partial extents
just fine...

Thanks for your patience with me :).

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux