On Thu, Mar 09, 2023 at 04:06:49PM +0100, Jan Kara wrote: > On Fri 27-01-23 18:07:38, Ojaswin Mujoo wrote: > > CR1_5 aims to optimize allocations which can't be satisfied in CR1. The > > fact that we couldn't find a group in CR1 suggests that it would be > > difficult to find a continuous extent to compleltely satisfy our > > allocations. So before falling to the slower CR2, in CR1.5 we > > proactively trim the the preallocations so we can find a group with > > (free / fragments) big enough. This speeds up our allocation at the > > cost of slightly reduced preallocation. > > > > The patch also adds a new sysfs tunable: > > > > * /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order > > > > This controls how much CR1.5 can trim a request before falling to CR2. > > For example, for a request of order 7 and max trim order 2, CR1.5 can > > trim this upto order 5. > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> > > The idea looks good. Couple of questions below... > > > +/* > > + * We couldn't find a group in CR1 so try to find the highest free fragment > > + * order we have and proactively trim the goal request length to that order to > > + * find a suitable group faster. > > + * > > + * This optimizes allocation speed at the cost of slightly reduced > > + * preallocations. However, we make sure that we don't trim the request too > > + * much and fall to CR2 in that case. > > + */ > > +static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac, > > + enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups) > > +{ > > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); > > + struct ext4_group_info *grp = NULL; > > + int i, order, min_order; > > + > > + if (unlikely(ac->ac_flags & EXT4_MB_CR1_5_OPTIMIZED)) { > > + if (sbi->s_mb_stats) > > + atomic_inc(&sbi->s_bal_cr1_5_bad_suggestions); > > + } > > + > > + /* > > + * mb_avg_fragment_size_order() returns order in a way that makes > > + * retrieving back the length using (1 << order) inaccurate. Hence, use > > + * fls() instead since we need to know the actual length while modifying > > + * goal length. > > + */ > > + order = fls(ac->ac_g_ex.fe_len); > > + min_order = order - sbi->s_mb_cr1_5_max_trim_order; > > Given we still require the allocation contains at least originally > requested blocks, is it ever the case that goal size would be 8 times > larger than original alloc size? Otherwise the > sbi->s_mb_cr1_5_max_trim_order logic seems a bit pointless... Yes that is possible. In ext4_mb_normalize_request, for orignal request len < 8MB we actually determine the goal length based on the length of the file (i_size) rather than the length of the original request. For eg: if (size <= 16 * 1024) { size = 16 * 1024; } else if (size <= 32 * 1024) { size = 32 * 1024; } else if (size <= 64 * 1024) { size = 64 * 1024; and this goes all the way upto size = 8MB. So for a case where the file is >8MB, even if the original len is of 1 block(4KB), the goal len would be of 2048 blocks(8MB). That's why we decided to add a tunable depending on the user's preference. > > > + if (min_order < 0) > > + min_order = 0; > > Perhaps add: > > if (1 << min_order < ac->ac_o_ex.fe_len) > min_order = fls(ac->ac_o_ex.fe_len) + 1; > > and then you can drop the condition from the loop below... That looks better, will do it. Thanks! > > > + > > + for (i = order; i >= min_order; i--) { > > + if (ac->ac_o_ex.fe_len <= (1 << i)) { > > + /* > > + * Scale down goal len to make sure we find something > > + * in the free fragments list. Basically, reduce > > + * preallocations. > > + */ > > + ac->ac_g_ex.fe_len = 1 << i; > > When scaling down the size with sbi->s_stripe > 1, it would be better to > choose multiple of sbi->s_stripe and not power of two. But our stripe > support is fairly weak anyway (e.g. initial goal size does not reflect it > at all AFAICT) so probably we don't care here either. Oh right, i missed that. I'll make the change as it doesn't harm to have it here. Thanks for the review! regards, ojaswin > > > + } else { > > + break; > > + } > > + > > + grp = ext4_mb_find_good_group_avg_frag_lists(ac, > > + mb_avg_fragment_size_order(ac->ac_sb, > > + ac->ac_g_ex.fe_len)); > > + if (grp) > > + break; > > + } > > + > > + if (grp) { > > + *group = grp->bb_group; > > + ac->ac_flags |= EXT4_MB_CR1_5_OPTIMIZED; > > + } else { > > + /* Reset goal length to original goal length before falling into CR2 */ > > + ac->ac_g_ex.fe_len = ac->ac_orig_goal_len; > > *new_cr = CR2; > > } > > } > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR