On 21/03/24 04:19PM, Harshad Shirwadkar wrote: > This patch series improves cr 0 and cr 1 passes of the allocator > signficantly. Currently, at cr 0 and 1, we perform linear lookups to > find the matching groups. That's very inefficient for large file > systems where there are millions of block groups. At cr 0, we only > care about the groups that have the largest free order >= the > request's order and at cr 1 we only care about groups where average > fragment size > the request size. so, this patchset introduces new > data structures that allow us to perform cr 0 lookup in constant time > and cr 1 lookup in log (number of groups) time instead of linear. > > For cr 0, we add a list for each order and all the groups are enqueued > to the appropriate list based on the largest free order in its buddy > bitmap. This allows us to lookup a match at cr 0 in constant time. > > For cr 1, we add a new rb tree of groups sorted by largest fragment > size. This allows us to lookup a match for cr 1 in log (num groups) > time. > > These optimizations can be enabled by passing "mb_optimize_scan" mount > option. > > These changes may result in allocations to be spread across the block > device. While that would not matter some block devices (such as flash) > it may be a cause of concern for other block devices that benefit from > storing related content togetther such as disk. However, it can be > argued that in high fragmentation scenrio, especially for large disks, > it's still worth optimizing the scanning since in such cases, we get > cpu bound on group scanning instead of getting IO bound. Perhaps, in > future, we could dynamically turn this new optimization on based on > fragmentation levels for such devices. > > Verified that there are no regressions in smoke tests (-g quick -c 4k). > > Also, to demonstrate the effectiveness for the patch series, following > experiment was performed: > > Created a highly fragmented disk of size 65TB. The disk had no > contiguous 2M regions. Following command was run consecutively for 3 > times: > > time dd if=/dev/urandom of=file bs=2M count=10 > > Here are the results with and without cr 0/1 optimizations: > > |---------+------------------------------+---------------------------| > | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations | > |---------+------------------------------+---------------------------| > | 1st run | 5m1.871s | 2m47.642s | > | 2nd run | 2m28.390s | 0m0.611s | > | 3rd run | 2m26.530s | 0m1.255s | > |---------+------------------------------+---------------------------| > > The patch [3/6] "ext4: add mballoc stats proc file" is a modified > version of the patch originally written by Artem Blagodarenko > (artem.blagodarenko@xxxxxxxxx). With that patch, I ran following > command with and without optimizations. > > dd if=/dev/zero of=/mnt/file bs=2M count=2 conv=fsync > > Without optimizations: > > useless_c0_loops: 3 > useless_c1_loops: 39 > useless_c2_loops: 0 > useless_c3_loops: 0 > > With optimizations: > > useless_c0_loops: 0 > useless_c1_loops: 0 > useless_c2_loops: 0 > useless_c3_loops: 0 > > This shows that CR0 and CR1 optimizations get rid of useless CR0 and > CR1 loops altogether thereby significantly reducing the number of > groups that get considered. > > Changes from V4: > ---------------- > - Only minor fixes, no significant changes > > Harshad Shirwadkar (6): > ext4: drop s_mb_bal_lock and convert protected fields to atomic > ext4: add ability to return parsed options from parse_options > ext4: add mballoc stats proc file > ext4: add MB_NUM_ORDERS macro > ext4: improve cr 0 / cr 1 group scanning > ext4: add proc files to monitor new structures > > fs/ext4/ext4.h | 30 ++- > fs/ext4/mballoc.c | 572 +++++++++++++++++++++++++++++++++++++++++++--- > fs/ext4/mballoc.h | 22 +- > fs/ext4/super.c | 79 +++++-- > fs/ext4/sysfs.c | 6 + > 5 files changed, 652 insertions(+), 57 deletions(-) > Completed my review of this patch series. Apart from the issue I mentioned in patch-5 of this v5 series. The rest of the patches looks fine to me. Please feel free to add: Reviewed-by: Ritesh Harjani <ritesh.list@xxxxxxxxx>