On Mar 30, 2021, at 09:20, harshad shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote: > > I ran some experiments this morning and here are the results: > > 1) Config 1: > - add change to skip CR0 / CR1 optimizations and fall back to > linear search when all groups are not loaded > - prefetch block bitmaps mount option turned off > - first_write_latency: 6m 56s > - number_of_groups_loaded_at_the_end_of_first_write: 30726 out of > 491520 total > > 2) Config 2: > - add change to skip CR0 / CR1 optimizations and fall back to > linear search when all groups are not loaded > - prefetch block bitmaps mount option turned on > - first_write_latency:4m42s > - number_of_groups_loaded_at_the_end_of_first_write: 490497 out > of 491520 total > > 3) Config 3: > - don't fallback to linear search > - prefetch block bitmaps turned off > - first_write_latency: 0.13s > - number_of_groups_loaded_at_the_end_of_first_write: 7 out of 491520 total > > 4) Config 4: > - don't fallback to linear search > - prefetch block bitmaps turned on > - first_write_latency: 0.15s > - number_of_groups_loaded_at_the_end_of_first_write: 48384 out of > 491520 total > > Based on the above numbers, I think we should go for config 4, which > means we should have prefetch_block_bitmaps turned on when > mb_optimize_scan is on. What do you think about turning > prefetch_block_bitmaps mount option on by default? I think that makes sense. > > - Harshad > >> On Mon, Mar 29, 2021 at 9:10 PM Andreas Dilger <adilger@xxxxxxxxx> wrote: >> >>> On Mar 29, 2021, at 17:03, harshad shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote: >>> >>> Another problem that I found in the patch was that when all groups are >>> not loaded, the optimization concludes that CR0 and CR1 cannot be used >>> for allocation (since these structures aren't populated). The problem >>> goes away if we also mount with prefetch_block_bitmaps mount option. >>> But I think if all the groups are not loaded, we should at least fall >>> back to linear search at CR1. Maybe skipping CR0 is okay. >> >> Alex's patch to skip cr0/1 if no groups are loaded is OK, as long as there is >> some group prefetch is triggered in the background, IMHO, and is there >> for later allocations. >> >> That avoids lots of scanning at mount time if the filesystem is mostly full, >> and the first allocation is not critical as long as this triggers group reads. >> >> Do you have any idea what the performance impact of this is (eg. >> latency to first write)? Ideally there would be some group prefetch done >> right at mount time (itable zero thread?) so that if writes are not >> done *right* after mount there would not be any noticeable latency. >> >> Cheers, Andreas >> >>>> On Mon, Mar 29, 2021 at 12:32 PM harshad shirwadkar >>>> <harshadshirwadkar@xxxxxxxxx> wrote: >>>> >>>> Okay sounds good. I'll do that! >>>> >>>>> On Mon, Mar 29, 2021 at 12:14 PM Andreas Dilger <adilger@xxxxxxxxx> wrote: >>>>> >>>>> On Mar 24, 2021, at 5:19 PM, Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote: >>>>>> >>>>>> Instead of traversing through groups linearly, scan groups in specific >>>>>> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the >>>>>> largest free order >= the order of the request. So, with this patch, >>>>>> we maintain lists for each possible order and insert each group into a >>>>>> list based on the largest free order in its buddy bitmap. During cr 0 >>>>>> allocation, we traverse these lists in the increasing order of largest >>>>>> free orders. This allows us to find a group with the best available cr >>>>>> 0 match in constant time. If nothing can be found, we fallback to cr 1 >>>>>> immediately. >>>>>> >>>>>> At CR1, the story is slightly different. We want to traverse in the >>>>>> order of increasing average fragment size. For CR1, we maintain a rb >>>>>> tree of groupinfos which is sorted by average fragment size. Instead >>>>>> of traversing linearly, at CR1, we traverse in the order of increasing >>>>>> average fragment size, starting at the most optimal group. This brings >>>>>> down cr 1 search complexity to log(num groups). >>>>>> >>>>>> For cr >= 2, we just perform the linear search as before. Also, in >>>>>> case of lock contention, we intermittently fallback to linear search >>>>>> even in CR 0 and CR 1 cases. This allows us to proceed during the >>>>>> allocation path even in case of high contention. >>>>> >>>>> If you are refreshing this patch anyway, could you rename "mb_linear_limit" >>>>> to "mb_linear_groups" or "mb_max_linear_groups" or similar? It otherwise >>>>> isn't clear what units the "limit" is in (could be MB/GB or something else). >>>>> >>>>> Cheers, Andreas >>>>> >>>>>> There is an opportunity to do optimization at CR2 too. That's because >>>>>> at CR2 we only consider groups where bb_free counter (number of free >>>>>> blocks) is greater than the request extent size. That's left as future >>>>>> work. >>>>>> >>>>>> All the changes introduced in this patch are protected under a new >>>>>> mount option "mb_optimize_scan". >>>>>> >>>>>> With this patchset, following experiment was performed: >>>>>> >>>>>> Created a highly fragmented disk of size 65TB. The disk had no >>>>>> contiguous 2M regions. Following command was run consecutively for 3 >>>>>> times: >>>>>> >>>>>> time dd if=/dev/urandom of=file bs=2M count=10 >>>>>> >>>>>> Here are the results with and without cr 0/1 optimizations introduced >>>>>> in this patch: >>>>>> >>>>>> |---------+------------------------------+---------------------------| >>>>>> | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations | >>>>>> |---------+------------------------------+---------------------------| >>>>>> | 1st run | 5m1.871s | 2m47.642s | >>>>>> | 2nd run | 2m28.390s | 0m0.611s | >>>>>> | 3rd run | 2m26.530s | 0m1.255s | >>>>>> |---------+------------------------------+---------------------------| >>>>>> >>>>>> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> >>>>>> Reported-by: kernel test robot <lkp@xxxxxxxxx> >>>>>> Reported-by: Dan Carpenter <dan.carpenter@xxxxxxxxxx> >>>>>> Reviewed-by: Andreas Dilger <adilger@xxxxxxxxx> >>>>>> --- >>>>>> fs/ext4/ext4.h | 19 ++- >>>>>> fs/ext4/mballoc.c | 381 ++++++++++++++++++++++++++++++++++++++++++++-- >>>>>> fs/ext4/mballoc.h | 17 ++- >>>>>> fs/ext4/super.c | 28 +++- >>>>>> fs/ext4/sysfs.c | 2 + >>>>>> 5 files changed, 432 insertions(+), 15 deletions(-) >>>>>> >>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >>>>>> index 85eeeba3bca3..5930c8cb5159 100644 >>>>>> --- a/fs/ext4/ext4.h >>>>>> +++ b/fs/ext4/ext4.h >>>>>> @@ -162,7 +162,10 @@ enum SHIFT_DIRECTION { >>>>>> #define EXT4_MB_USE_RESERVED 0x2000 >>>>>> /* Do strict check for free blocks while retrying block allocation */ >>>>>> #define EXT4_MB_STRICT_CHECK 0x4000 >>>>>> - >>>>>> +/* Large fragment size list lookup succeeded at least once for cr = 0 */ >>>>>> +#define EXT4_MB_CR0_OPTIMIZED 0x8000 >>>>>> +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */ >>>>>> +#define EXT4_MB_CR1_OPTIMIZED 0x00010000 >>>>>> struct ext4_allocation_request { >>>>>> /* target inode for block we're allocating */ >>>>>> struct inode *inode; >>>>>> @@ -1247,7 +1250,9 @@ struct ext4_inode_info { >>>>>> #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */ >>>>>> #define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */ >>>>>> #define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */ >>>>>> - >>>>>> +#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group >>>>>> + * scanning in mballoc >>>>>> + */ >>>>>> >>>>>> #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \ >>>>>> ~EXT4_MOUNT_##opt >>>>>> @@ -1527,9 +1532,14 @@ struct ext4_sb_info { >>>>>> unsigned int s_mb_free_pending; >>>>>> struct list_head s_freed_data_list; /* List of blocks to be freed >>>>>> after commit completed */ >>>>>> + struct rb_root s_mb_avg_fragment_size_root; >>>>>> + rwlock_t s_mb_rb_lock; >>>>>> + struct list_head *s_mb_largest_free_orders; >>>>>> + rwlock_t *s_mb_largest_free_orders_locks; >>>>>> >>>>>> /* tunables */ >>>>>> unsigned long s_stripe; >>>>>> + unsigned int s_mb_linear_limit; >>>>>> unsigned int s_mb_stream_request; >>>>>> unsigned int s_mb_max_to_scan; >>>>>> unsigned int s_mb_min_to_scan; >>>>>> @@ -1553,6 +1563,8 @@ struct ext4_sb_info { >>>>>> atomic_t s_bal_goals; /* goal hits */ >>>>>> atomic_t s_bal_breaks; /* too long searches */ >>>>>> atomic_t s_bal_2orders; /* 2^order hits */ >>>>>> + atomic_t s_bal_cr0_bad_suggestions; >>>>>> + atomic_t s_bal_cr1_bad_suggestions; >>>>>> atomic64_t s_bal_cX_groups_considered[4]; >>>>>> atomic64_t s_bal_cX_hits[4]; >>>>>> atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */ >>>>>> @@ -3309,11 +3321,14 @@ struct ext4_group_info { >>>>>> ext4_grpblk_t bb_free; /* total free blocks */ >>>>>> ext4_grpblk_t bb_fragments; /* nr of freespace fragments */ >>>>>> ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */ >>>>>> + ext4_group_t bb_group; /* Group number */ >>>>>> struct list_head bb_prealloc_list; >>>>>> #ifdef DOUBLE_CHECK >>>>>> void *bb_bitmap; >>>>>> #endif >>>>>> struct rw_semaphore alloc_sem; >>>>>> + struct rb_node bb_avg_fragment_size_rb; >>>>>> + struct list_head bb_largest_free_order_node; >>>>>> ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block >>>>>> * regions, index is order. >>>>>> * bb_counters[3] = 5 means >>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >>>>>> index 15127d815461..cbf9a89c0ef5 100644 >>>>>> --- a/fs/ext4/mballoc.c >>>>>> +++ b/fs/ext4/mballoc.c >>>>>> @@ -127,11 +127,50 @@ >>>>>> * smallest multiple of the stripe value (sbi->s_stripe) which is >>>>>> * greater than the default mb_group_prealloc. >>>>>> * >>>>>> + * If "mb_optimize_scan" mount option is set, we maintain in memory group info >>>>>> + * structures in two data structures: >>>>>> + * >>>>>> + * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders) >>>>>> + * >>>>>> + * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks) >>>>>> + * >>>>>> + * This is an array of lists where the index in the array represents the >>>>>> + * largest free order in the buddy bitmap of the participating group infos of >>>>>> + * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total >>>>>> + * number of buddy bitmap orders possible) number of lists. Group-infos are >>>>>> + * placed in appropriate lists. >>>>>> + * >>>>>> + * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root) >>>>>> + * >>>>>> + * Locking: sbi->s_mb_rb_lock (rwlock) >>>>>> + * >>>>>> + * This is a red black tree consisting of group infos and the tree is sorted >>>>>> + * by average fragment sizes (which is calculated as ext4_group_info->bb_free >>>>>> + * / ext4_group_info->bb_fragments). >>>>>> + * >>>>>> + * When "mb_optimize_scan" mount option is set, mballoc consults the above data >>>>>> + * structures to decide the order in which groups are to be traversed for >>>>>> + * fulfilling an allocation request. >>>>>> + * >>>>>> + * At CR = 0, we look for groups which have the largest_free_order >= the order >>>>>> + * of the request. We directly look at the largest free order list in the data >>>>>> + * structure (1) above where largest_free_order = order of the request. If that >>>>>> + * list is empty, we look at remaining list in the increasing order of >>>>>> + * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time. >>>>>> + * >>>>>> + * At CR = 1, we only consider groups where average fragment size > request >>>>>> + * size. So, we lookup a group which has average fragment size just above or >>>>>> + * equal to request size using our rb tree (data structure 2) in O(log N) time. >>>>>> + * >>>>>> + * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in >>>>>> + * linear order which requires O(N) search time for each CR 0 and CR 1 phase. >>>>>> + * >>>>>> * The regular allocator (using the buddy cache) supports a few tunables. >>>>>> * >>>>>> * /sys/fs/ext4/<partition>/mb_min_to_scan >>>>>> * /sys/fs/ext4/<partition>/mb_max_to_scan >>>>>> * /sys/fs/ext4/<partition>/mb_order2_req >>>>>> + * /sys/fs/ext4/<partition>/mb_linear_limit >>>>>> * >>>>>> * The regular allocator uses buddy scan only if the request len is power of >>>>>> * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The >>>>>> @@ -149,6 +188,16 @@ >>>>>> * can be used for allocation. ext4_mb_good_group explains how the groups are >>>>>> * checked. >>>>>> * >>>>>> + * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not >>>>>> + * get traversed linearly. That may result in subsequent allocations being not >>>>>> + * close to each other. And so, the underlying device may get filled up in a >>>>>> + * non-linear fashion. While that may not matter on non-rotational devices, for >>>>>> + * rotational devices that may result in higher seek times. "mb_linear_limit" >>>>>> + * tells mballoc how many groups mballoc should search linearly before >>>>>> + * performing consulting above data structures for more efficient lookups. For >>>>>> + * non rotational devices, this value defaults to 0 and for rotational devices >>>>>> + * this is set to MB_DEFAULT_LINEAR_LIMIT. >>>>>> + * >>>>>> * Both the prealloc space are getting populated as above. So for the first >>>>>> * request we will hit the buddy cache which will result in this prealloc >>>>>> * space getting filled. The prealloc space is then later used for the >>>>>> @@ -299,6 +348,8 @@ >>>>>> * - bitlock on a group (group) >>>>>> * - object (inode/locality) (object) >>>>>> * - per-pa lock (pa) >>>>>> + * - cr0 lists lock (cr0) >>>>>> + * - cr1 tree lock (cr1) >>>>>> * >>>>>> * Paths: >>>>>> * - new pa >>>>>> @@ -328,6 +379,9 @@ >>>>>> * group >>>>>> * object >>>>>> * >>>>>> + * - allocation path (ext4_mb_regular_allocator) >>>>>> + * group >>>>>> + * cr0/cr1 >>>>>> */ >>>>>> static struct kmem_cache *ext4_pspace_cachep; >>>>>> static struct kmem_cache *ext4_ac_cachep; >>>>>> @@ -351,6 +405,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, >>>>>> ext4_group_t group); >>>>>> static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac); >>>>>> >>>>>> +static bool ext4_mb_good_group(struct ext4_allocation_context *ac, >>>>>> + ext4_group_t group, int cr); >>>>>> + >>>>>> /* >>>>>> * The algorithm using this percpu seq counter goes below: >>>>>> * 1. We sample the percpu discard_pa_seq counter before trying for block >>>>>> @@ -744,6 +801,251 @@ static void ext4_mb_mark_free_simple(struct super_block *sb, >>>>>> } >>>>>> } >>>>>> >>>>>> +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new, >>>>>> + int (*cmp)(struct rb_node *, struct rb_node *)) >>>>>> +{ >>>>>> + struct rb_node **iter = &root->rb_node, *parent = NULL; >>>>>> + >>>>>> + while (*iter) { >>>>>> + parent = *iter; >>>>>> + if (cmp(new, *iter) > 0) >>>>>> + iter = &((*iter)->rb_left); >>>>>> + else >>>>>> + iter = &((*iter)->rb_right); >>>>>> + } >>>>>> + >>>>>> + rb_link_node(new, parent, iter); >>>>>> + rb_insert_color(new, root); >>>>>> +} >>>>>> + >>>>>> +static int >>>>>> +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2) >>>>>> +{ >>>>>> + struct ext4_group_info *grp1 = rb_entry(rb1, >>>>>> + struct ext4_group_info, >>>>>> + bb_avg_fragment_size_rb); >>>>>> + struct ext4_group_info *grp2 = rb_entry(rb2, >>>>>> + struct ext4_group_info, >>>>>> + bb_avg_fragment_size_rb); >>>>>> + int num_frags_1, num_frags_2; >>>>>> + >>>>>> + num_frags_1 = grp1->bb_fragments ? >>>>>> + grp1->bb_free / grp1->bb_fragments : 0; >>>>>> + num_frags_2 = grp2->bb_fragments ? >>>>>> + grp2->bb_free / grp2->bb_fragments : 0; >>>>>> + >>>>>> + return (num_frags_2 - num_frags_1); >>>>>> +} >>>>>> + >>>>>> +/* >>>>>> + * Reinsert grpinfo into the avg_fragment_size tree with new average >>>>>> + * fragment size. >>>>>> + */ >>>>>> +static void >>>>>> +mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp) >>>>>> +{ >>>>>> + struct ext4_sb_info *sbi = EXT4_SB(sb); >>>>>> + >>>>>> + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0) >>>>>> + return; >>>>>> + >>>>>> + write_lock(&sbi->s_mb_rb_lock); >>>>>> + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) { >>>>>> + rb_erase(&grp->bb_avg_fragment_size_rb, >>>>>> + &sbi->s_mb_avg_fragment_size_root); >>>>>> + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb); >>>>>> + } >>>>>> + >>>>>> + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root, >>>>>> + &grp->bb_avg_fragment_size_rb, >>>>>> + ext4_mb_avg_fragment_size_cmp); >>>>>> + write_unlock(&sbi->s_mb_rb_lock); >>>>>> +} >>>>>> + >>>>>> +/* >>>>>> + * Choose next group by traversing largest_free_order lists. Return 0 if next >>>>>> + * group was selected optimally. Return 1 if next group was not selected >>>>>> + * optimally. Updates *new_cr if cr level needs an update. >>>>>> + */ >>>>>> +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac, >>>>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups) >>>>>> +{ >>>>>> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); >>>>>> + struct ext4_group_info *iter, *grp; >>>>>> + int i; >>>>>> + >>>>>> + if (ac->ac_status == AC_STATUS_FOUND) >>>>>> + return 1; >>>>>> + >>>>>> + if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED)) >>>>>> + atomic_inc(&sbi->s_bal_cr0_bad_suggestions); >>>>>> + >>>>>> + grp = NULL; >>>>>> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) { >>>>>> + if (list_empty(&sbi->s_mb_largest_free_orders[i])) >>>>>> + continue; >>>>>> + read_lock(&sbi->s_mb_largest_free_orders_locks[i]); >>>>>> + if (list_empty(&sbi->s_mb_largest_free_orders[i])) { >>>>>> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); >>>>>> + continue; >>>>>> + } >>>>>> + grp = NULL; >>>>>> + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i], >>>>>> + bb_largest_free_order_node) { >>>>>> + if (sbi->s_mb_stats) >>>>>> + atomic64_inc(&sbi->s_bal_cX_groups_considered[0]); >>>>>> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) { >>>>>> + grp = iter; >>>>>> + break; >>>>>> + } >>>>>> + } >>>>>> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); >>>>>> + if (grp) >>>>>> + break; >>>>>> + } >>>>>> + >>>>>> + if (!grp) { >>>>>> + /* Increment cr and search again */ >>>>>> + *new_cr = 1; >>>>>> + } else { >>>>>> + *group = grp->bb_group; >>>>>> + ac->ac_last_optimal_group = *group; >>>>>> + ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED; >>>>>> + } >>>>>> + return 0; >>>>>> +} >>>>>> + >>>>>> +/* >>>>>> + * Choose next group by traversing average fragment size tree. Return 0 if next >>>>>> + * group was selected optimally. Return 1 if next group could not selected >>>>>> + * optimally (due to lock contention). Updates *new_cr if cr lvel needs an >>>>>> + * update. >>>>>> + */ >>>>>> +static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac, >>>>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups) >>>>>> +{ >>>>>> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); >>>>>> + int avg_fragment_size, best_so_far; >>>>>> + struct rb_node *node, *found; >>>>>> + struct ext4_group_info *grp; >>>>>> + >>>>>> + /* >>>>>> + * If there is contention on the lock, instead of waiting for the lock >>>>>> + * to become available, just continue searching lineraly. We'll resume >>>>>> + * our rb tree search later starting at ac->ac_last_optimal_group. >>>>>> + */ >>>>>> + if (!read_trylock(&sbi->s_mb_rb_lock)) >>>>>> + return 1; >>>>>> + >>>>>> + if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) { >>>>>> + if (sbi->s_mb_stats) >>>>>> + atomic_inc(&sbi->s_bal_cr1_bad_suggestions); >>>>>> + /* We have found something at CR 1 in the past */ >>>>>> + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group); >>>>>> + for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL; >>>>>> + found = rb_next(found)) { >>>>>> + grp = rb_entry(found, struct ext4_group_info, >>>>>> + bb_avg_fragment_size_rb); >>>>>> + if (sbi->s_mb_stats) >>>>>> + atomic64_inc(&sbi->s_bal_cX_groups_considered[1]); >>>>>> + if (likely(ext4_mb_good_group(ac, grp->bb_group, 1))) >>>>>> + break; >>>>>> + } >>>>>> + >>>>>> + goto done; >>>>>> + } >>>>>> + >>>>>> + node = sbi->s_mb_avg_fragment_size_root.rb_node; >>>>>> + best_so_far = 0; >>>>>> + found = NULL; >>>>>> + >>>>>> + while (node) { >>>>>> + grp = rb_entry(node, struct ext4_group_info, >>>>>> + bb_avg_fragment_size_rb); >>>>>> + avg_fragment_size = 0; >>>>>> + /* >>>>>> + * Perform this check without locking, we'll lock later to confirm. >>>>>> + */ >>>>>> + if (ext4_mb_good_group(ac, grp->bb_group, 1)) { >>>>>> + avg_fragment_size = grp->bb_fragments ? >>>>>> + grp->bb_free / grp->bb_fragments : 0; >>>>>> + if (!best_so_far || avg_fragment_size < best_so_far) { >>>>>> + best_so_far = avg_fragment_size; >>>>>> + found = node; >>>>>> + } >>>>>> + } >>>>>> + if (avg_fragment_size > ac->ac_g_ex.fe_len) >>>>>> + node = node->rb_right; >>>>>> + else >>>>>> + node = node->rb_left; >>>>>> + } >>>>>> + >>>>>> +done: >>>>>> + if (found) { >>>>>> + grp = rb_entry(found, struct ext4_group_info, >>>>>> + bb_avg_fragment_size_rb); >>>>>> + *group = grp->bb_group; >>>>>> + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED; >>>>>> + } else { >>>>>> + *new_cr = 2; >>>>>> + } >>>>>> + >>>>>> + read_unlock(&sbi->s_mb_rb_lock); >>>>>> + ac->ac_last_optimal_group = *group; >>>>>> + return 0; >>>>>> +} >>>>>> + >>>>>> +/* >>>>>> + * ext4_mb_choose_next_group: choose next group for allocation. >>>>>> + * >>>>>> + * @ac Allocation Context >>>>>> + * @new_cr This is an output parameter. If the there is no good group >>>>>> + * available at current CR level, this field is updated to indicate >>>>>> + * the new cr level that should be used. >>>>>> + * @group This is an input / output parameter. As an input it indicates the >>>>>> + * last group used for allocation. As output, this field indicates >>>>>> + * the next group that should be used. >>>>>> + * @ngroups Total number of groups >>>>>> + */ >>>>>> +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, >>>>>> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups) >>>>>> +{ >>>>>> + int ret; >>>>>> + >>>>>> + *new_cr = ac->ac_criteria; >>>>>> + >>>>>> + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) || >>>>>> + *new_cr >= 2 || >>>>>> + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) >>>>>> + goto inc_and_return; >>>>>> + >>>>>> + if (ac->ac_groups_linear_remaining) { >>>>>> + ac->ac_groups_linear_remaining--; >>>>>> + goto inc_and_return; >>>>>> + } >>>>>> + >>>>>> + if (*new_cr == 0) { >>>>>> + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups); >>>>>> + if (ret) >>>>>> + goto inc_and_return; >>>>>> + } >>>>>> + if (*new_cr == 1) { >>>>>> + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups); >>>>>> + if (ret) >>>>>> + goto inc_and_return; >>>>>> + } >>>>>> + return; >>>>>> + >>>>>> +inc_and_return: >>>>>> + /* >>>>>> + * Artificially restricted ngroups for non-extent >>>>>> + * files makes group > ngroups possible on first loop. >>>>>> + */ >>>>>> + *group = *group + 1; >>>>>> + if (*group >= ngroups) >>>>>> + *group = 0; >>>>>> +} >>>>>> + >>>>>> /* >>>>>> * Cache the order of the largest free extent we have available in this block >>>>>> * group. >>>>>> @@ -751,18 +1053,33 @@ static void ext4_mb_mark_free_simple(struct super_block *sb, >>>>>> static void >>>>>> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) >>>>>> { >>>>>> + struct ext4_sb_info *sbi = EXT4_SB(sb); >>>>>> int i; >>>>>> - int bits; >>>>>> >>>>>> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { >>>>>> + write_lock(&sbi->s_mb_largest_free_orders_locks[ >>>>>> + grp->bb_largest_free_order]); >>>>>> + list_del_init(&grp->bb_largest_free_order_node); >>>>>> + write_unlock(&sbi->s_mb_largest_free_orders_locks[ >>>>>> + grp->bb_largest_free_order]); >>>>>> + } >>>>>> grp->bb_largest_free_order = -1; /* uninit */ >>>>>> >>>>>> - bits = MB_NUM_ORDERS(sb) - 1; >>>>>> - for (i = bits; i >= 0; i--) { >>>>>> + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) { >>>>>> if (grp->bb_counters[i] > 0) { >>>>>> grp->bb_largest_free_order = i; >>>>>> break; >>>>>> } >>>>>> } >>>>>> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && >>>>>> + grp->bb_largest_free_order >= 0 && grp->bb_free) { >>>>>> + write_lock(&sbi->s_mb_largest_free_orders_locks[ >>>>>> + grp->bb_largest_free_order]); >>>>>> + list_add_tail(&grp->bb_largest_free_order_node, >>>>>> + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]); >>>>>> + write_unlock(&sbi->s_mb_largest_free_orders_locks[ >>>>>> + grp->bb_largest_free_order]); >>>>>> + } >>>>>> } >>>>>> >>>>>> static noinline_for_stack >>>>>> @@ -818,6 +1135,7 @@ void ext4_mb_generate_buddy(struct super_block *sb, >>>>>> period = get_cycles() - period; >>>>>> atomic_inc(&sbi->s_mb_buddies_generated); >>>>>> atomic64_add(period, &sbi->s_mb_generation_time); >>>>>> + mb_update_avg_fragment_size(sb, grp); >>>>>> } >>>>>> >>>>>> /* The buddy information is attached the buddy cache inode >>>>>> @@ -1517,6 +1835,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, >>>>>> >>>>>> done: >>>>>> mb_set_largest_free_order(sb, e4b->bd_info); >>>>>> + mb_update_avg_fragment_size(sb, e4b->bd_info); >>>>>> mb_check_buddy(e4b); >>>>>> } >>>>>> >>>>>> @@ -1653,6 +1972,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) >>>>>> } >>>>>> mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info); >>>>>> >>>>>> + mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info); >>>>>> ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0); >>>>>> mb_check_buddy(e4b); >>>>>> >>>>>> @@ -2347,17 +2667,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) >>>>>> * from the goal value specified >>>>>> */ >>>>>> group = ac->ac_g_ex.fe_group; >>>>>> + ac->ac_last_optimal_group = group; >>>>>> + ac->ac_groups_linear_remaining = sbi->s_mb_linear_limit; >>>>>> prefetch_grp = group; >>>>>> >>>>>> - for (i = 0; i < ngroups; group++, i++) { >>>>>> - int ret = 0; >>>>>> + for (i = 0; i < ngroups; i++) { >>>>>> + int ret = 0, new_cr; >>>>>> + >>>>>> cond_resched(); >>>>>> - /* >>>>>> - * Artificially restricted ngroups for non-extent >>>>>> - * files makes group > ngroups possible on first loop. >>>>>> - */ >>>>>> - if (group >= ngroups) >>>>>> - group = 0; >>>>>> + >>>>>> + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups); >>>>>> + >>>>>> + if (new_cr != cr) { >>>>>> + cr = new_cr; >>>>>> + goto repeat; >>>>>> + } >>>>>> >>>>>> /* >>>>>> * Batch reads of the block allocation bitmaps >>>>>> @@ -2578,6 +2902,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset) >>>>>> atomic64_read(&sbi->s_bal_cX_groups_considered[0])); >>>>>> seq_printf(seq, "\t\tuseless_loops: %llu\n", >>>>>> atomic64_read(&sbi->s_bal_cX_failed[0])); >>>>>> + seq_printf(seq, "\t\tbad_suggestions: %u\n", >>>>>> + atomic_read(&sbi->s_bal_cr0_bad_suggestions)); >>>>>> >>>>>> seq_puts(seq, "\tcr1_stats:\n"); >>>>>> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1])); >>>>>> @@ -2585,6 +2911,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset) >>>>>> atomic64_read(&sbi->s_bal_cX_groups_considered[1])); >>>>>> seq_printf(seq, "\t\tuseless_loops: %llu\n", >>>>>> atomic64_read(&sbi->s_bal_cX_failed[1])); >>>>>> + seq_printf(seq, "\t\tbad_suggestions: %u\n", >>>>>> + atomic_read(&sbi->s_bal_cr1_bad_suggestions)); >>>>>> >>>>>> seq_puts(seq, "\tcr2_stats:\n"); >>>>>> seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2])); >>>>>> @@ -2719,7 +3047,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group, >>>>>> INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list); >>>>>> init_rwsem(&meta_group_info[i]->alloc_sem); >>>>>> meta_group_info[i]->bb_free_root = RB_ROOT; >>>>>> + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node); >>>>>> + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb); >>>>>> meta_group_info[i]->bb_largest_free_order = -1; /* uninit */ >>>>>> + meta_group_info[i]->bb_group = group; >>>>>> >>>>>> mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group); >>>>>> return 0; >>>>>> @@ -2909,6 +3240,26 @@ int ext4_mb_init(struct super_block *sb) >>>>>> i++; >>>>>> } while (i < MB_NUM_ORDERS(sb)); >>>>>> >>>>>> + sbi->s_mb_avg_fragment_size_root = RB_ROOT; >>>>>> + sbi->s_mb_largest_free_orders = >>>>>> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head), >>>>>> + GFP_KERNEL); >>>>>> + if (!sbi->s_mb_largest_free_orders) { >>>>>> + ret = -ENOMEM; >>>>>> + goto out; >>>>>> + } >>>>>> + sbi->s_mb_largest_free_orders_locks = >>>>>> + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t), >>>>>> + GFP_KERNEL); >>>>>> + if (!sbi->s_mb_largest_free_orders_locks) { >>>>>> + ret = -ENOMEM; >>>>>> + goto out; >>>>>> + } >>>>>> + for (i = 0; i < MB_NUM_ORDERS(sb); i++) { >>>>>> + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]); >>>>>> + rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]); >>>>>> + } >>>>>> + rwlock_init(&sbi->s_mb_rb_lock); >>>>>> >>>>>> spin_lock_init(&sbi->s_md_lock); >>>>>> sbi->s_mb_free_pending = 0; >>>>>> @@ -2961,6 +3312,10 @@ int ext4_mb_init(struct super_block *sb) >>>>>> spin_lock_init(&lg->lg_prealloc_lock); >>>>>> } >>>>>> >>>>>> + if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev))) >>>>>> + sbi->s_mb_linear_limit = 0; >>>>>> + else >>>>>> + sbi->s_mb_linear_limit = MB_DEFAULT_LINEAR_LIMIT; >>>>>> /* init file for buddy data */ >>>>>> ret = ext4_mb_init_backend(sb); >>>>>> if (ret != 0) >>>>>> @@ -2972,6 +3327,8 @@ int ext4_mb_init(struct super_block *sb) >>>>>> free_percpu(sbi->s_locality_groups); >>>>>> sbi->s_locality_groups = NULL; >>>>>> out: >>>>>> + kfree(sbi->s_mb_largest_free_orders); >>>>>> + kfree(sbi->s_mb_largest_free_orders_locks); >>>>>> kfree(sbi->s_mb_offsets); >>>>>> sbi->s_mb_offsets = NULL; >>>>>> kfree(sbi->s_mb_maxs); >>>>>> @@ -3028,6 +3385,8 @@ int ext4_mb_release(struct super_block *sb) >>>>>> kvfree(group_info); >>>>>> rcu_read_unlock(); >>>>>> } >>>>>> + kfree(sbi->s_mb_largest_free_orders); >>>>>> + kfree(sbi->s_mb_largest_free_orders_locks); >>>>>> kfree(sbi->s_mb_offsets); >>>>>> kfree(sbi->s_mb_maxs); >>>>>> iput(sbi->s_buddy_cache); >>>>>> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h >>>>>> index 68111a10cfee..02585e3cbcad 100644 >>>>>> --- a/fs/ext4/mballoc.h >>>>>> +++ b/fs/ext4/mballoc.h >>>>>> @@ -78,6 +78,18 @@ >>>>>> */ >>>>>> #define MB_DEFAULT_MAX_INODE_PREALLOC 512 >>>>>> >>>>>> +/* >>>>>> + * Number of groups to search linearly before performing group scanning >>>>>> + * optimization. >>>>>> + */ >>>>>> +#define MB_DEFAULT_LINEAR_LIMIT 4 >>>>>> + >>>>>> +/* >>>>>> + * Minimum number of groups that should be present in the file system to perform >>>>>> + * group scanning optimizations. >>>>>> + */ >>>>>> +#define MB_DEFAULT_LINEAR_SCAN_THRESHOLD 16 >>>>>> + >>>>>> /* >>>>>> * Number of valid buddy orders >>>>>> */ >>>>>> @@ -166,11 +178,14 @@ struct ext4_allocation_context { >>>>>> /* copy of the best found extent taken before preallocation efforts */ >>>>>> struct ext4_free_extent ac_f_ex; >>>>>> >>>>>> + ext4_group_t ac_last_optimal_group; >>>>>> + __u32 ac_groups_considered; >>>>>> + __u32 ac_flags; /* allocation hints */ >>>>>> __u16 ac_groups_scanned; >>>>>> + __u16 ac_groups_linear_remaining; >>>>>> __u16 ac_found; >>>>>> __u16 ac_tail; >>>>>> __u16 ac_buddy; >>>>>> - __u16 ac_flags; /* allocation hints */ >>>>>> __u8 ac_status; >>>>>> __u8 ac_criteria; >>>>>> __u8 ac_2order; /* if request is to allocate 2^N blocks and >>>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c >>>>>> index 3a2cd9fb7e73..6116640081c0 100644 >>>>>> --- a/fs/ext4/super.c >>>>>> +++ b/fs/ext4/super.c >>>>>> @@ -1687,7 +1687,7 @@ enum { >>>>>> Opt_dioread_nolock, Opt_dioread_lock, >>>>>> Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable, >>>>>> Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache, >>>>>> - Opt_prefetch_block_bitmaps, >>>>>> + Opt_prefetch_block_bitmaps, Opt_mb_optimize_scan, >>>>>> #ifdef CONFIG_EXT4_DEBUG >>>>>> Opt_fc_debug_max_replay, Opt_fc_debug_force >>>>>> #endif >>>>>> @@ -1788,6 +1788,7 @@ static const match_table_t tokens = { >>>>>> {Opt_nombcache, "nombcache"}, >>>>>> {Opt_nombcache, "no_mbcache"}, /* for backward compatibility */ >>>>>> {Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"}, >>>>>> + {Opt_mb_optimize_scan, "mb_optimize_scan=%d"}, >>>>>> {Opt_removed, "check=none"}, /* mount option from ext2/3 */ >>>>>> {Opt_removed, "nocheck"}, /* mount option from ext2/3 */ >>>>>> {Opt_removed, "reservation"}, /* mount option from ext2/3 */ >>>>>> @@ -1820,6 +1821,8 @@ static ext4_fsblk_t get_sb_block(void **data) >>>>>> } >>>>>> >>>>>> #define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3)) >>>>>> +#define DEFAULT_MB_OPTIMIZE_SCAN (-1) >>>>>> + >>>>>> static const char deprecated_msg[] = >>>>>> "Mount option \"%s\" will be removed by %s\n" >>>>>> "Contact linux-ext4@xxxxxxxxxxxxxxx if you think we should keep it.\n"; >>>>>> @@ -2008,6 +2011,7 @@ static const struct mount_opts { >>>>>> {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET}, >>>>>> {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS, >>>>>> MOPT_SET}, >>>>>> + {Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, MOPT_GTE0}, >>>>>> #ifdef CONFIG_EXT4_DEBUG >>>>>> {Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT, >>>>>> MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY}, >>>>>> @@ -2092,6 +2096,7 @@ static int ext4_set_test_dummy_encryption(struct super_block *sb, >>>>>> struct ext4_parsed_options { >>>>>> unsigned long journal_devnum; >>>>>> unsigned int journal_ioprio; >>>>>> + int mb_optimize_scan; >>>>>> }; >>>>>> >>>>>> static int handle_mount_opt(struct super_block *sb, char *opt, int token, >>>>>> @@ -2388,6 +2393,13 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, >>>>>> sbi->s_mount_opt |= m->mount_opt; >>>>>> } else if (token == Opt_data_err_ignore) { >>>>>> sbi->s_mount_opt &= ~m->mount_opt; >>>>>> + } else if (token == Opt_mb_optimize_scan) { >>>>>> + if (arg != 0 && arg != 1) { >>>>>> + ext4_msg(sb, KERN_WARNING, >>>>>> + "mb_optimize_scan should be set to 0 or 1."); >>>>>> + return -1; >>>>>> + } >>>>>> + parsed_opts->mb_optimize_scan = arg; >>>>>> } else { >>>>>> if (!args->from) >>>>>> arg = 1; >>>>>> @@ -4034,6 +4046,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) >>>>>> /* Set defaults for the variables that will be set during parsing */ >>>>>> parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO; >>>>>> parsed_opts.journal_devnum = 0; >>>>>> + parsed_opts.mb_optimize_scan = DEFAULT_MB_OPTIMIZE_SCAN; >>>>>> >>>>>> if ((data && !orig_data) || !sbi) >>>>>> goto out_free_base; >>>>>> @@ -4984,6 +4997,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) >>>>>> ext4_fc_replay_cleanup(sb); >>>>>> >>>>>> ext4_ext_init(sb); >>>>>> + >>>>>> + /* >>>>>> + * Enable optimize_scan if number of groups is > threshold. This can be >>>>>> + * turned off by passing "mb_optimize_scan=0". This can also be >>>>>> + * turned on forcefully by passing "mb_optimize_scan=1". >>>>>> + */ >>>>>> + if (parsed_opts.mb_optimize_scan == 1) >>>>>> + set_opt2(sb, MB_OPTIMIZE_SCAN); >>>>>> + else if (parsed_opts.mb_optimize_scan == 0) >>>>>> + clear_opt2(sb, MB_OPTIMIZE_SCAN); >>>>>> + else if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD) >>>>>> + set_opt2(sb, MB_OPTIMIZE_SCAN); >>>>>> + >>>>>> err = ext4_mb_init(sb); >>>>>> if (err) { >>>>>> ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)", >>>>>> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c >>>>>> index 59ca9d73b42f..16b8a838f631 100644 >>>>>> --- a/fs/ext4/sysfs.c >>>>>> +++ b/fs/ext4/sysfs.c >>>>>> @@ -213,6 +213,7 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs); >>>>>> EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request); >>>>>> EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc); >>>>>> EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc); >>>>>> +EXT4_RW_ATTR_SBI_UI(mb_linear_limit, s_mb_linear_limit); >>>>>> EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb); >>>>>> EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error); >>>>>> EXT4_RW_ATTR_SBI_UI(err_ratelimit_interval_ms, s_err_ratelimit_state.interval); >>>>>> @@ -260,6 +261,7 @@ static struct attribute *ext4_attrs[] = { >>>>>> ATTR_LIST(mb_stream_req), >>>>>> ATTR_LIST(mb_group_prealloc), >>>>>> ATTR_LIST(mb_max_inode_prealloc), >>>>>> + ATTR_LIST(mb_linear_limit), >>>>>> ATTR_LIST(max_writeback_mb_bump), >>>>>> ATTR_LIST(extent_max_zeroout_kb), >>>>>> ATTR_LIST(trigger_fs_error), >>>>>> -- >>>>>> 2.31.0.291.g576ba9dcdaf-goog >>>>>> >>>>> >>>>> >>>>> Cheers, Andreas >>>>> >>>>> >>>>> >>>>> >>>>>