On Feb 9, 2021, at 1:28 PM, Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote: > > Instead of traversing through groups linearly, scan groups in specific > orders at cr 0 and cr 1. At cr 0, we want to find groups that have the > largest free order >= the order of the request. So, with this patch, > we maintain lists for each possible order and insert each group into a > list based on the largest free order in its buddy bitmap. During cr 0 > allocation, we traverse these lists in the increasing order of largest > free orders. This allows us to find a group with the best available cr > 0 match in constant time. If nothing can be found, we fallback to cr 1 > immediately. > > At CR1, the story is slightly different. We want to traverse in the > order of increasing average fragment size. For CR1, we maintain a rb > tree of groupinfos which is sorted by average fragment size. Instead > of traversing linearly, at CR1, we traverse in the order of increasing > average fragment size, starting at the most optimal group. This brings > down cr 1 search complexity to log(num groups). > > For cr >= 2, we just perform the linear search as before. Also, in > case of lock contention, we intermittently fallback to linear search > even in CR 0 and CR 1 cases. This allows us to proceed during the > allocation path even in case of high contention. > > There is an opportunity to do optimization at CR2 too. That's because > at CR2 we only consider groups where bb_free counter (number of free > blocks) is greater than the request extent size. That's left as future > work. > > All the changes introduced in this patch are protected under a new > mount option "mb_optimize_scan". > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> > --- > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index b7f25120547d..63562f5f42f1 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > > +/* > + * Choose next group by traversing largest_free_order lists. Return 0 if next > + * group was selected optimally. Return 1 if next group was not selected > + * optimally. Updates *new_cr if cr level needs an update. > + */ > +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac, > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups) > +{ > + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) { > + if (list_empty(&sbi->s_mb_largest_free_orders[i])) > + continue; > + read_lock(&sbi->s_mb_largest_free_orders_locks[i]); > + if (list_empty(&sbi->s_mb_largest_free_orders[i])) { > + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); > + continue; > + } > + grp = NULL; > + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i], > + bb_largest_free_order_node) { > + /* > + * Perform this check without a lock, once we lock > + * the group, we'll perform this check again. > + */ This comment is no longer correct. > + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) { > + grp = iter; > + break; > + } > + } > + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); > + if (grp) > + break; > + } > +} > +/* > + * ext4_mb_choose_next_group: choose next group for allocation. > + * > + * @ac Allocation Context > + * @new_cr This is an output parameter. If the there is no good group available > + * at current CR level, this field is updated to indicate the new cr > + * level that should be used. > + * @group This is an input / output parameter. As an input it indicates the last > + * group used for allocation. As output, this field indicates the > + * next group that should be used. > + * @ngroups Total number of groups > + */ > +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, > + int *new_cr, ext4_group_t *group, ext4_group_t ngroups) > +{ > + int ret; > + > + *new_cr = ac->ac_criteria; > + > + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) || > + *new_cr >= 2 || > + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) > + goto inc_and_return; I still think it would be beneficial to check if the next group is good before going to the list/tree. That will reduce lock contention, and will also avoid needless seeking between groups if possible. > + if (*new_cr == 0) { > + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups); > + if (ret) > + goto inc_and_return; > + } > + if (*new_cr == 1) { > + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups); > + if (ret) > + goto inc_and_return; > + } > + return; > + > +inc_and_return: > + /* > + * Artificially restricted ngroups for non-extent > + * files makes group > ngroups possible on first loop. > + */ > + *group = *group + 1; > + if (*group >= ngroups) > + *group = 0; > +} > + > /* > * Cache the order of the largest free extent we have available in this block > * group. > @@ -751,18 +1001,32 @@ static void ext4_mb_mark_free_simple(struct super_block *sb, > static void > mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) > { > + struct ext4_sb_info *sbi = EXT4_SB(sb); > int i; > - int bits; > > + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { > + write_lock(&sbi->s_mb_largest_free_orders_locks[ > + grp->bb_largest_free_order]); > + list_del_init(&grp->bb_largest_free_order_node); > + write_unlock(&sbi->s_mb_largest_free_orders_locks[ > + grp->bb_largest_free_order]); > + } > grp->bb_largest_free_order = -1; /* uninit */ > > - bits = MB_NUM_ORDERS(sb) - 1; > - for (i = bits; i >= 0; i--) { > + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) { > if (grp->bb_counters[i] > 0) { > grp->bb_largest_free_order = i; > break; > } > } > + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { > + write_lock(&sbi->s_mb_largest_free_orders_locks[ > + grp->bb_largest_free_order]); > + list_add_tail(&grp->bb_largest_free_order_node, > + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]); > + write_unlock(&sbi->s_mb_largest_free_orders_locks[ > + grp->bb_largest_free_order]); > + } > } This function would be more efficient to do the list move under a single write lock if the order doesn't change. The order loop would just save the largest free order, then grab the write lock, do the list_del(), set bb_largest_free_order, and list_add_tail(): mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) { struct ext4_sb_info *sbi = EXT4_SB(sb); int i, new_order = -1; for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) { if (grp->bb_counters[i] > 0) { new_order = i; break; } } if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { write_lock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); list_del_init(&grp->bb_largest_free_order_node); if (new_order != grp->bb_largest_free_order) { write_unlock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); grp->bb_largest_free_order = new_order; write_lock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); } list_add_tail(&grp->bb_largest_free_order_node, &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]); write_unlock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); } } Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP