Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> writes: > When the length of best extent found is less than the length of goal extent > we need to make sure that the best extent atleast covers the start of the > original request. This is done by adjusting the ac_b_ex.fe_logical (logical > start) of the extent. > > While doing so, the current logic sometimes results in the best extent's logical > range overflowing the goal extent. Since this best extent is later added to the > inode preallocation list, we have a possibility of introducing overlapping > preallocations. This is discussed in detail here [1]. So if the best extent range overflows the goal extent range, then it causes the overlapping ranges to be added to the list is because at the time of normalization of the request during allocation, we decide goal start and end based on the existing PAs in the list. And if we end up adding the best found extent range which overflows the goal range, then that will add a PA whose logical start/end might overlap with an existing PA range. > > To fix this, replace the existing logic with the below logic for adjusting best > extent as it keeps fragmentation in check while ensuring logical range of best > extent doesn't overflow out of goal extent: > > 1. Check if best extent can be kept at end of goal range and still cover > original start. > 2. Else, check if best extent can be kept at start of goal range and still cover > original start. > 3. Else, keep the best extent at start of original request. > > Also, add a few extra BUG_ONs that might help catch errors faster. Thanks for the detailed explaination. I think it makes sense. > > [1] https://lore.kernel.org/r/Y+OGkVvzPN0RMv0O@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Yes, the example in the above link was helpful. > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > --- > fs/ext4/mballoc.c | 49 ++++++++++++++++++++++++++++++----------------- > 1 file changed, 31 insertions(+), 18 deletions(-) Also this looks like, it was not a problem till now because while finding a right PA with existing list implementation, we adjust our allocation window based on all the elements in the inode PA list (full scan of inode PAs) But in case of rbtree, we try to do a log(n) search and if we have an overlapping range, then we might end up missing a subtree and calculate a wrong allocation request range. So the only problem in the current code would be that we might have some overlapping ranges in the inode PAs which means we have some extra reservations done for the same logical file ranges. But other than that I guess we don't have any other issue. Is that also the reason why we are not adding any fixes tag and cc'ing it to stable? I have reviewed the patch along with my understanding above. Looks good to me. So please feel free to add - Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index fdb9d0a8f35d..ba9d26e2f2aa 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4330,6 +4330,7 @@ static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac, > BUG_ON(start < pa->pa_pstart); > BUG_ON(end > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len)); > BUG_ON(pa->pa_free < len); > + BUG_ON(ac->ac_b_ex.fe_len <= 0); > pa->pa_free -= len; > > mb_debug(ac->ac_sb, "use %llu/%d from inode pa %p\n", start, len, pa); > @@ -4668,10 +4669,8 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) > pa = ac->ac_pa; > > if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) { > - int winl; > - int wins; > - int win; > - int offs; > + int new_bex_start; > + int new_bex_end; > > /* we can't allocate as much as normalizer wants. > * so, found space must get proper lstart > @@ -4679,26 +4678,40 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) > BUG_ON(ac->ac_g_ex.fe_logical > ac->ac_o_ex.fe_logical); > BUG_ON(ac->ac_g_ex.fe_len < ac->ac_o_ex.fe_len); > > - /* we're limited by original request in that > - * logical block must be covered any way > - * winl is window we can move our chunk within */ > - winl = ac->ac_o_ex.fe_logical - ac->ac_g_ex.fe_logical; > + /* > + * Use the below logic for adjusting best extent as it keeps > + * fragmentation in check while ensuring logical range of best > + * extent doesn't overflow out of goal extent: > + * > + * 1. Check if best ex can be kept at end of goal and still > + * cover original start > + * 2. Else, check if best ex can be kept at start of goal and > + * still cover original start > + * 3. Else, keep the best ex at start of original request. > + */ > + new_bex_end = ac->ac_g_ex.fe_logical + > + EXT4_C2B(sbi, ac->ac_g_ex.fe_len); > + new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len); > + if (ac->ac_o_ex.fe_logical >= new_bex_start) > + goto adjust_bex; > > - /* also, we should cover whole original request */ > - wins = EXT4_C2B(sbi, ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len); > + new_bex_start = ac->ac_g_ex.fe_logical; > + new_bex_end = > + new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len); > + if (ac->ac_o_ex.fe_logical < new_bex_end) > + goto adjust_bex; > > - /* the smallest one defines real window */ > - win = min(winl, wins); > + new_bex_start = ac->ac_o_ex.fe_logical; > + new_bex_end = > + new_bex_start + EXT4_C2B(sbi, ac->ac_b_ex.fe_len); > > - offs = ac->ac_o_ex.fe_logical % > - EXT4_C2B(sbi, ac->ac_b_ex.fe_len); > - if (offs && offs < win) > - win = offs; > +adjust_bex: > + ac->ac_b_ex.fe_logical = new_bex_start; > > - ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - > - EXT4_NUM_B2C(sbi, win); > BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical); > BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len); > + BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical + > + EXT4_C2B(sbi, ac->ac_g_ex.fe_len))); > } > > /* preallocation can change ac_b_ex, thus we store actually > -- > 2.31.1