Re: [PATCH v2 4/5] ext4: improve cr 0 / cr 1 group scanning

Andreas Dilger <adilger@xxxxxxxxx> · Thu, 11 Feb 2021 03:30:31 -0700

On Feb 9, 2021, at 1:28 PM, Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote:
> 
> Instead of traversing through groups linearly, scan groups in specific
> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> largest free order >= the order of the request. So, with this patch,
> we maintain lists for each possible order and insert each group into a
> list based on the largest free order in its buddy bitmap. During cr 0
> allocation, we traverse these lists in the increasing order of largest
> free orders. This allows us to find a group with the best available cr
> 0 match in constant time. If nothing can be found, we fallback to cr 1
> immediately.
> 
> At CR1, the story is slightly different. We want to traverse in the
> order of increasing average fragment size. For CR1, we maintain a rb
> tree of groupinfos which is sorted by average fragment size. Instead
> of traversing linearly, at CR1, we traverse in the order of increasing
> average fragment size, starting at the most optimal group. This brings
> down cr 1 search complexity to log(num groups).
> 
> For cr >= 2, we just perform the linear search as before. Also, in
> case of lock contention, we intermittently fallback to linear search
> even in CR 0 and CR 1 cases. This allows us to proceed during the
> allocation path even in case of high contention.
> 
> There is an opportunity to do optimization at CR2 too. That's because
> at CR2 we only consider groups where bb_free counter (number of free
> blocks) is greater than the request extent size. That's left as future
> work.
> 
> All the changes introduced in this patch are protected under a new
> mount option "mb_optimize_scan".
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx>
> ---
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index b7f25120547d..63562f5f42f1 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> 
> +/*
> + * Choose next group by traversing largest_free_order lists. Return 0 if next
> + * group was selected optimally. Return 1 if next group was not selected
> + * optimally. Updates *new_cr if cr level needs an update.
> + */
> +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> +		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> +	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> +		if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> +			continue;
> +		read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> +		if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> +			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> +			continue;
> +		}
> +		grp = NULL;
> +		list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> +				    bb_largest_free_order_node) {
> +			/*
> +			 * Perform this check without a lock, once we lock
> +			 * the group, we'll perform this check again.
> +			 */

This comment is no longer correct.

> +			if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
> +				grp = iter;
> +				break;
> +			}
> +		}
> +		read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> +		if (grp)
> +			break;
> +	}
> +}

> +/*
> + * ext4_mb_choose_next_group: choose next group for allocation.
> + *
> + * @ac        Allocation Context
> + * @new_cr    This is an output parameter. If the there is no good group available
> + *            at current CR level, this field is updated to indicate the new cr
> + *            level that should be used.
> + * @group     This is an input / output parameter. As an input it indicates the last
> + *            group used for allocation. As output, this field indicates the
> + *            next group that should be used.
> + * @ngroups   Total number of groups
> + */
> +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> +		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> +	int ret;
> +
> +	*new_cr = ac->ac_criteria;
> +
> +	if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
> +	    *new_cr >= 2 ||
> +	    !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> +		goto inc_and_return;

I still think it would be beneficial to check if the next group is good
before going to the list/tree.  That will reduce lock contention, and
will also avoid needless seeking between groups if possible.

> +	if (*new_cr == 0) {
> +		ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> +		if (ret)
> +			goto inc_and_return;
> +	}
> +	if (*new_cr == 1) {
> +		ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> +		if (ret)
> +			goto inc_and_return;
> +	}
> +	return;
> +
> +inc_and_return:
> +	/*
> +	 * Artificially restricted ngroups for non-extent
> +	 * files makes group > ngroups possible on first loop.
> +	 */
> +	*group = *group + 1;
> +	if (*group >= ngroups)
> +		*group = 0;
> +}
> +
> /*
>  * Cache the order of the largest free extent we have available in this block
>  * group.
> @@ -751,18 +1001,32 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> static void
> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> {
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> 	int i;
> -	int bits;
> 
> +	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
> +		write_lock(&sbi->s_mb_largest_free_orders_locks[
> +					      grp->bb_largest_free_order]);
> +		list_del_init(&grp->bb_largest_free_order_node);
> +		write_unlock(&sbi->s_mb_largest_free_orders_locks[
> +					      grp->bb_largest_free_order]);
> +	}
> 	grp->bb_largest_free_order = -1; /* uninit */
> 
> -	bits = MB_NUM_ORDERS(sb) - 1;
> -	for (i = bits; i >= 0; i--) {
> +	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
> 		if (grp->bb_counters[i] > 0) {
> 			grp->bb_largest_free_order = i;
> 			break;
> 		}
> 	}
> +	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
> +		write_lock(&sbi->s_mb_largest_free_orders_locks[
> +					      grp->bb_largest_free_order]);
> +		list_add_tail(&grp->bb_largest_free_order_node,
> +		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> +		write_unlock(&sbi->s_mb_largest_free_orders_locks[
> +					      grp->bb_largest_free_order]);
> +	}
> }

This function would be more efficient to do the list move under a single
write lock if the order doesn't change.  The order loop would just
save the largest free order, then grab the write lock, do the list_del(),
set bb_largest_free_order, and list_add_tail():

mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
{
	struct ext4_sb_info *sbi = EXT4_SB(sb);
	int i, new_order = -1;

	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
		if (grp->bb_counters[i] > 0) {
			new_order = i;
			break;
		}
	}
	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
		write_lock(&sbi->s_mb_largest_free_orders_locks[
					      grp->bb_largest_free_order]);
		list_del_init(&grp->bb_largest_free_order_node);

		if (new_order != grp->bb_largest_free_order) {
			write_unlock(&sbi->s_mb_largest_free_orders_locks[
					      grp->bb_largest_free_order]);
			grp->bb_largest_free_order = new_order;
			write_lock(&sbi->s_mb_largest_free_orders_locks[
					      grp->bb_largest_free_order]);
		}
		list_add_tail(&grp->bb_largest_free_order_node,
		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
		write_unlock(&sbi->s_mb_largest_free_orders_locks[
					      grp->bb_largest_free_order]);
	}
}

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP