Hello Andreas, Thank you for feedback! I really wanted send new version (with test results, but without kernel decision-maker) of this patch this evening, but you were faster. > On 30 May 2019, at 19:56, Andreas Dilger <adilger@xxxxxxxxx> wrote: > > Artem, we discussed this patch on the Ext4 concall today. A couple > of items came up during discussion: > - the patch submission should include performance results to > show that the patch is providing an improvement > - it would be preferable if the thresholds for the stages were found > dynamically in the kernel based on how many groups have been skipped > and the free chunk size in each group > - there would need to be some way to dynamically reset the scanning > level when lots of blocks have been freed > > Cheers, Andreas My suggestion is split this plan to 2 phases. Phase 1 - loop skipping code and interface to user-mode that gives to administrator ability configure loop-skipping code. Phase 2 in kernel discussion-maker based on groups info (and some other information) Here are testing results I wanted to add to new patch version. Adding it here for descussion: Here are some aproach test results. During test, system was fragmented with pattern "50 free blocks - 50 occupied blocks". Performance digradated from 1.2 Gb/sed to 10 MB/sec. 68719476736 bytes (69 GB) copied, 6619.02 s, 10.4 MB/s Let's exlude c1 loops echo "60" > /sys/fs/ext4/md0/mb_c1_threshold Excluding c1 loops doesn't change performance. Same 10 MB/s Statistics shows that 981753 c1 loops were skipped, but 1664192 finished without sucess. mballoc: (7829, 1664192, 0) useless c(0,1,2) loops mballoc: (981753, 0, 0) skipped c(0,1,2) loops Then c1 and c2 loops ware disabled. echo "60" > /sys/fs/ext4/md0/mb_c1_threshold echo "60" > /sys/fs/ext4/md0/mb_c2_threshold mballoc: (0, 0, 0) useless c(0,1,2) loops mballoc: (1425456, 1393743, 0) skipped c(0,1,2) loops A lot of loops c1 and c2 skipped. For given fragmentation write performance returned to ~500 MB/s 68719476736 bytes (69 GB) copied, 133.066 s, 516 MB/s This is example how to improve performance for exact partition fragmentation. The patch adds interfaces for adjusting block allocator for any situation. Best regards, Artem Blagodarenko. >> On Mar 11, 2019, at 03:08, Artem Blagodarenko <artem.blagodarenko@xxxxxxxxx> wrote: >> >> Block allocator tries to find: >> 1) group with the same range as required >> 2) group with the same average range as required >> 3) group with required amount of space >> 4) any group >> >> For quite full disk step 1 is failed with higth >> probability, but takes a lot of time. >> >> Skip 1st step if disk full > 75% >> Skip 2d step if disk full > 85% >> Skip 3d step if disk full > 95% >> >> This three tresholds can be adjusted through added interface. >> >> Signed-off-by: Artem Blagodarenko <c17828@xxxxxxxx> >> --- >> fs/ext4/ext4.h | 3 +++ >> fs/ext4/mballoc.c | 32 ++++++++++++++++++++++++++++++++ >> fs/ext4/mballoc.h | 3 +++ >> fs/ext4/sysfs.c | 6 ++++++ >> 4 files changed, 44 insertions(+) >> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >> index 185a05d3257e..fbccb459a296 100644 >> --- a/fs/ext4/ext4.h >> +++ b/fs/ext4/ext4.h >> @@ -1431,6 +1431,9 @@ struct ext4_sb_info { >> unsigned int s_mb_min_to_scan; >> unsigned int s_mb_stats; >> unsigned int s_mb_order2_reqs; >> + unsigned int s_mb_c1_treshold; >> + unsigned int s_mb_c2_treshold; >> + unsigned int s_mb_c3_treshold; >> unsigned int s_mb_group_prealloc; >> unsigned int s_max_dir_size_kb; >> /* where last allocation was done - for stream allocation */ >> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >> index 4e6c36ff1d55..85f364aa96c9 100644 >> --- a/fs/ext4/mballoc.c >> +++ b/fs/ext4/mballoc.c >> @@ -2096,6 +2096,20 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, >> return 0; >> } >> >> +static u64 available_blocks_count(struct ext4_sb_info *sbi) >> +{ >> + ext4_fsblk_t resv_blocks; >> + u64 bfree; >> + struct ext4_super_block *es = sbi->s_es; >> + >> + resv_blocks = EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); >> + bfree = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - >> + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); >> + >> + bfree = EXT4_C2B(sbi, max_t(s64, bfree, 0)); >> + return bfree - (ext4_r_blocks_count(es) + resv_blocks); >> +} >> + >> static noinline_for_stack int >> ext4_mb_regular_allocator(struct ext4_allocation_context *ac) >> { >> @@ -2104,10 +2118,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) >> int err = 0, first_err = 0; >> struct ext4_sb_info *sbi; >> struct super_block *sb; >> + struct ext4_super_block *es; >> struct ext4_buddy e4b; >> + unsigned int free_rate; >> >> sb = ac->ac_sb; >> sbi = EXT4_SB(sb); >> + es = sbi->s_es; >> ngroups = ext4_get_groups_count(sb); >> /* non-extent files are limited to low blocks/groups */ >> if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) >> @@ -2157,6 +2174,18 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) >> >> /* Let's just scan groups to find more-less suitable blocks */ >> cr = ac->ac_2order ? 0 : 1; >> + >> + /* Choose what loop to pass based on disk fullness */ >> + free_rate = available_blocks_count(sbi) * 100 / ext4_blocks_count(es); >> + >> + if (free_rate < sbi->s_mb_c3_treshold) { >> + cr = 3; >> + } else if(free_rate < sbi->s_mb_c2_treshold) { >> + cr = 2; >> + } else if(free_rate < sbi->s_mb_c1_treshold) { >> + cr = 1; >> + } >> + >> /* >> * cr == 0 try to get exact allocation, >> * cr == 3 try to get anything >> @@ -2618,6 +2647,9 @@ int ext4_mb_init(struct super_block *sb) >> sbi->s_mb_stats = MB_DEFAULT_STATS; >> sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD; >> sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS; >> + sbi->s_mb_c1_treshold = MB_DEFAULT_C1_TRESHOLD; >> + sbi->s_mb_c2_treshold = MB_DEFAULT_C2_TRESHOLD; >> + sbi->s_mb_c3_treshold = MB_DEFAULT_C3_TRESHOLD; >> /* >> * The default group preallocation is 512, which for 4k block >> * sizes translates to 2 megabytes. However for bigalloc file >> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h >> index 88c98f17e3d9..d880923e55a5 100644 >> --- a/fs/ext4/mballoc.h >> +++ b/fs/ext4/mballoc.h >> @@ -71,6 +71,9 @@ do { \ >> * for which requests use 2^N search using buddies >> */ >> #define MB_DEFAULT_ORDER2_REQS 2 >> +#define MB_DEFAULT_C1_TRESHOLD 25 >> +#define MB_DEFAULT_C2_TRESHOLD 15 >> +#define MB_DEFAULT_C3_TRESHOLD 5 >> >> /* >> * default group prealloc size 512 blocks >> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c >> index 9212a026a1f1..e4f1d98195c2 100644 >> --- a/fs/ext4/sysfs.c >> +++ b/fs/ext4/sysfs.c >> @@ -175,6 +175,9 @@ EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats); >> EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan); >> EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan); >> EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs); >> +EXT4_RW_ATTR_SBI_UI(mb_c1_treshold, s_mb_c1_treshold); >> +EXT4_RW_ATTR_SBI_UI(mb_c2_treshold, s_mb_c2_treshold); >> +EXT4_RW_ATTR_SBI_UI(mb_c3_treshold, s_mb_c3_treshold); >> EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request); >> EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc); >> EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb); >> @@ -203,6 +206,9 @@ static struct attribute *ext4_attrs[] = { >> ATTR_LIST(mb_max_to_scan), >> ATTR_LIST(mb_min_to_scan), >> ATTR_LIST(mb_order2_req), >> + ATTR_LIST(mb_c1_treshold), >> + ATTR_LIST(mb_c2_treshold), >> + ATTR_LIST(mb_c3_treshold), >> ATTR_LIST(mb_stream_req), >> ATTR_LIST(mb_group_prealloc), >> ATTR_LIST(max_writeback_mb_bump), >> -- >> 2.14.3 >>