On Feb 16, 2021, at 12:39 PM, Благодаренко Артём <artem.blagodarenko@xxxxxxxxx> wrote: > Thanks for this useful optimisation. > > Some comments bellow. > >> On 9 Feb 2021, at 23:28, Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote: >> >> Instead of traversing through groups linearly, scan groups in specific >> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the >> largest free order >= the order of the request. So, with this patch, >> we maintain lists for each possible order and insert each group into a >> list based on the largest free order in its buddy bitmap. During cr 0 >> allocation, we traverse these lists in the increasing order of largest >> free orders. This allows us to find a group with the best available cr >> 0 match in constant time. If nothing can be found, we fallback to cr 1 >> immediately. >> >> At CR1, the story is slightly different. We want to traverse in the >> order of increasing average fragment size. For CR1, we maintain a rb >> tree of groupinfos which is sorted by average fragment size. Instead >> of traversing linearly, at CR1, we traverse in the order of increasing >> average fragment size, starting at the most optimal group. This brings >> down cr 1 search complexity to log(num groups). >> >> For cr >= 2, we just perform the linear search as before. Also, in >> case of lock contention, we intermittently fallback to linear search >> even in CR 0 and CR 1 cases. This allows us to proceed during the >> allocation path even in case of high contention. >> >> There is an opportunity to do optimization at CR2 too. That's because >> at CR2 we only consider groups where bb_free counter (number of free >> blocks) is greater than the request extent size. That's left as future >> work. >> >> +static int >> +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2) >> +{ >> + struct ext4_group_info *grp1 = rb_entry(rb1, >> + struct ext4_group_info, >> + bb_avg_fragment_size_rb); >> + struct ext4_group_info *grp2 = rb_entry(rb2, >> + struct ext4_group_info, >> + bb_avg_fragment_size_rb); >> + int num_frags_1, num_frags_2; >> + >> + num_frags_1 = grp1->bb_fragments ? >> + grp1->bb_free / grp1->bb_fragments : 0; >> + num_frags_2 = grp2->bb_fragments ? >> + grp2->bb_free / grp2->bb_fragments : 0; >> + >> + return (num_frags_1 < num_frags_2); >> +} >> + >> +/* >> + * Reinsert grpinfo into the avg_fragment_size tree with new average >> + * fragment size. >> + */ > > Walk along the ngroups linked elements in worst case for every mb_free_blocks and mb_mark_used which are quite frequently executed actions. > If double-linked list is used for avg_fragments this function will make this change without iterating through the list: > 1. Check with previous element. If smaller, then commute > 2. Check with next element. If greater, then commute. I was wondering about the cost of the list/tree maintenance as well, especially since there was a post from "kernel test robot" that this patch introduced a performance regression. The tree insertion/removal overhead I think Artem's proposal above would improve, since it may be that a group will not move in the tree much? It would also make sense for totally full groups to be kept out of the rb tree entirely, since they do not provide any value in that case (the full groups will never be selected for allocations), and they just add to the tree depth and potentially cause an imbalance if there are many of them. That also has the benefit of the rbtree efficiency *improving* as the filesystem gets more full, which is right when it is most needed. It might also make sense to keep totally empty groups out of the rbtree, since they should always be found in cr0 already if the allocation is large enough to fill the whole group? Having a smaller rbtree makes every insertion/removal that much more efficient. Those groups will naturally be re-added into the rbtree when they have blocks freed or allocated, so not much added complexity. Does it make sense to disable "mb_optimize_scan" if filesystems are smaller than a certain threshold? Clearly, if there are only 1-2 groups, maintaining a list and rbtree has no real value, and with only a handful of groups (< 16?) linear searching is probably as fast or faster than maintaining the two data structures. That is similar to e.g. bubble sort vs. quicksort, where it is more efficient to sort a list of ~5-8 entries with a dumb/fast algorithm instead of a complex algorithm that is more efficient at larger scales. That would also (likely) quiet the kernel test robot, if we think that its testing is not representative of real-world usage. > On Feb 11, 2021, at 3:30 AM, Andreas Dilger <adilger@xxxxxxxxx> wrote: >> This function would be more efficient to do the list move under a single >> write lock if the order doesn't change. The order loop would just >> save the largest free order, then grab the write lock, do the list_del(), >> set bb_largest_free_order, and list_add_tail(): >> >> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) >> { >> struct ext4_sb_info *sbi = EXT4_SB(sb); >> int i, new_order = -1; >> >> for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) { >> if (grp->bb_counters[i] > 0) { >> new_order = i; >> break; >> } >> } >> if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { >> write_lock(&sbi->s_mb_largest_free_orders_locks[ >> grp->bb_largest_free_order]); >> list_del_init(&grp->bb_largest_free_order_node); >> >> if (new_order != grp->bb_largest_free_order) { >> write_unlock(&sbi->s_mb_largest_free_orders_locks[ >> grp->bb_largest_free_order]); >> grp->bb_largest_free_order = new_order; >> write_lock(&sbi->s_mb_largest_free_orders_locks[ >> grp->bb_largest_free_order]); >> } >> list_add_tail(&grp->bb_largest_free_order_node, >> &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]); >> write_unlock(&sbi->s_mb_largest_free_orders_locks[ >> grp->bb_largest_free_order]); >> } >> } In looking at my previous comment, I wonder if we could further reduce the list locking here by not moving an entry to the end of the *same* list if it is not currently at the head? Since it was (presumably) just moved to the end of the list by a recent allocation, it is very likely that some other group will be chosen from the list head, so moving within the list to maintain strict LRU is probably just extra locking overhead that can be avoided... Also, it isn't clear if *freeing* blocks from a group should move it to the end of the same list, or just leave it as-is? If there are more frees from the list it is likely to be added to a new list soon, and if there are no more frees, then it could stay in the same order. Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP