On Feb 26, 2021, at 12:36 PM, Harshad Shirwadkar <harshadshirwadkar@xxxxxxxxx> wrote: > Instead of traversing through groups linearly, scan groups in specific > orders at cr 0 and cr 1. At cr 0, we want to find groups that have the > largest free order >= the order of the request. So, with this patch, > we maintain lists for each possible order and insert each group into a > list based on the largest free order in its buddy bitmap. During cr 0 > allocation, we traverse these lists in the increasing order of largest > free orders. This allows us to find a group with the best available cr > 0 match in constant time. If nothing can be found, we fallback to cr 1 > immediately. Thanks for the updated patch, I think it looks pretty good, with a few suggestions. > At CR1, the story is slightly different. We want to traverse in the > order of increasing average fragment size. For CR1, we maintain a rb > tree of groupinfos which is sorted by average fragment size. Instead > of traversing linearly, at CR1, we traverse in the order of increasing > average fragment size, starting at the most optimal group. This brings > down cr 1 search complexity to log(num groups). One thing that came to mind here is that *average* fragment size is not necessarily a good heuristic for allocation. Consider a group with mean fragment size of 2MB, made up of equal parts 4MB and 4KB free blocks. If looking for many 2MB allocations this would quickly cause the 4MB chunks to be split (pushing down the average below 2MB) even when there are *exactly* 2MB free chunks available in that group. Another alternative (short of having a full "free extents" tree like XFS does) would be to keep the rbtree sorted by *maximum* fragment size. That would not be any more expensive (one entry per group) and guarantee that at least one free extent closely matches the size of the extent being allocated. I _suspect_ that the cr1 allocations are mostly for smaller files/tails that are not power-of-two sizes (which would normally be handled by cr0 except in pathological test cases), so finding an exact match is the right thing to do. In that case, the cr0 pass would handle most of the file's data, and cr1 would handle smaller files or the tail that are not 2^n extents. Filling (nearly) exact fragments would be good for space efficiency, and avoid fragmenting larger extents when that is not needed. That said, I'm not confident enough about this to suggest that this has to be changed before landing the patch, just an observation that came to mind. Having a good workload simulator (or actual application benchmark) would be needed to assess the difference here, since a synthetic workload (e.g. fill alternate 2MB chunks) is not reflective of how the filesystem would be used in real life. A few minor comments inline... > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 161412070fef..bcfd849bc61e 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > > +/* > + * ext4_mb_choose_next_group: choose next group for allocation. > + * > + * @ac Allocation Context > + * @new_cr This is an output parameter. If the there is no good group available > + * at current CR level, this field is updated to indicate the new cr > + * level that should be used. > + * @group This is an input / output parameter. As an input it indicates the last > + * group used for allocation. As output, this field indicates the > + * next group that should be used. No reason why these comments can't be wrapped at 80 columns > @@ -2938,6 +3273,20 @@ int ext4_mb_init(struct super_block *sb) > spin_lock_init(&lg->lg_prealloc_lock); > } > > + if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev))) > + sbi->s_mb_linear_limit = 0; > + else > + sbi->s_mb_linear_limit = MB_DEFAULT_LINEAR_LIMIT; > +#ifndef CONFIG_EXT4_DEBUG > + /* > + * Disable mb_optimize scan if we don't have enough groups. If > + * CONFIG_EXT4_DEBUG is set, we don't disable this MB_OPTIMIZE_SCAN even > + * for small file systems. This allows us to test correctness on small > + * file systems. > + */ > + if (ext4_get_groups_count(sb) < MB_DEFAULT_LINEAR_SCAN_THRESHOLD) > + clear_opt2(sb, MB_OPTIMIZE_SCAN); > +#endif Making this a compile-time option makes it much harder to test. Having this managed by the /proc mb_linear_scan tunable or mount option would be useful. Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP