Re: [PATCH v3 4/5] ext4: improve cr 0 / cr 1 group scanning

Andreas Dilger <adilger@xxxxxxxxx> · Mon, 1 Mar 2021 15:22:55 -0700

On Feb 26, 2021, at 12:36 PM, Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote:
> Instead of traversing through groups linearly, scan groups in specific
> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> largest free order >= the order of the request. So, with this patch,
> we maintain lists for each possible order and insert each group into a
> list based on the largest free order in its buddy bitmap. During cr 0
> allocation, we traverse these lists in the increasing order of largest
> free orders. This allows us to find a group with the best available cr
> 0 match in constant time. If nothing can be found, we fallback to cr 1
> immediately.

Thanks for the updated patch, I think it looks pretty good, with a
few suggestions.

> At CR1, the story is slightly different. We want to traverse in the
> order of increasing average fragment size. For CR1, we maintain a rb
> tree of groupinfos which is sorted by average fragment size. Instead
> of traversing linearly, at CR1, we traverse in the order of increasing
> average fragment size, starting at the most optimal group. This brings
> down cr 1 search complexity to log(num groups).

One thing that came to mind here is that *average* fragment size is
not necessarily a good heuristic for allocation.  Consider a group with
mean fragment size of 2MB, made up of equal parts 4MB and 4KB free
blocks.  If looking for many 2MB allocations this would quickly cause
the 4MB chunks to be split (pushing down the average below 2MB) even
when there are *exactly* 2MB free chunks available in that group.

Another alternative (short of having a full "free extents" tree like
XFS does) would be to keep the rbtree sorted by *maximum* fragment
size.  That would not be any more expensive (one entry per group)
and guarantee that at least one free extent closely matches the size
of the extent being allocated.  I _suspect_ that the cr1 allocations
are mostly for smaller files/tails that are not power-of-two sizes
(which would normally be handled by cr0 except in pathological test
cases), so finding an exact match is the right thing to do.

In that case, the cr0 pass would handle most of the file's data, and
cr1 would handle smaller files or the tail that are not 2^n extents.
Filling (nearly) exact fragments would be good for space efficiency,
and avoid fragmenting larger extents when that is not needed.

That said, I'm not confident enough about this to suggest that this
has to be changed before landing the patch, just an observation that
came to mind.  Having a good workload simulator (or actual application
benchmark) would be needed to assess the difference here, since a
synthetic workload (e.g. fill alternate 2MB chunks) is not reflective
of how the filesystem would be used in real life.

A few minor comments inline...

> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 161412070fef..bcfd849bc61e 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> 
> +/*
> + * ext4_mb_choose_next_group: choose next group for allocation.
> + *
> + * @ac        Allocation Context
> + * @new_cr    This is an output parameter. If the there is no good group available
> + *            at current CR level, this field is updated to indicate the new cr
> + *            level that should be used.
> + * @group     This is an input / output parameter. As an input it indicates the last
> + *            group used for allocation. As output, this field indicates the
> + *            next group that should be used.

No reason why these comments can't be wrapped at 80 columns

> @@ -2938,6 +3273,20 @@ int ext4_mb_init(struct super_block *sb)
> 		spin_lock_init(&lg->lg_prealloc_lock);
> 	}
> 
> +	if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev)))
> +		sbi->s_mb_linear_limit = 0;
> +	else
> +		sbi->s_mb_linear_limit = MB_DEFAULT_LINEAR_LIMIT;
> +#ifndef CONFIG_EXT4_DEBUG
> +	/*
> +	 * Disable mb_optimize scan if we don't have enough groups. If
> +	 * CONFIG_EXT4_DEBUG is set, we don't disable this MB_OPTIMIZE_SCAN even
> +	 * for small file systems. This allows us to test correctness on small
> +	 * file systems.
> +	 */
> +	if (ext4_get_groups_count(sb) < MB_DEFAULT_LINEAR_SCAN_THRESHOLD)
> +		clear_opt2(sb, MB_OPTIMIZE_SCAN);
> +#endif

Making this a compile-time option makes it much harder to test.  Having this
managed by the /proc mb_linear_scan tunable or mount option would be useful.

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP