On Wed, Mar 20, 2024 at 09:30:57AM +0800, Baokun Li wrote: > On 2024/3/20 2:25, Ojaswin Mujoo wrote: > > On Tue, Mar 19, 2024 at 06:05:53PM +0800, Baokun Li wrote: > > > On 2024/3/18 20:39, Ojaswin Mujoo wrote: > > > > On Thu, Mar 14, 2024 at 10:09:01PM +0800, Baokun Li wrote: > > > > > --- a/fs/ext4/mballoc.c > > > > > +++ b/fs/ext4/mballoc.c > > > > > @@ -831,6 +831,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len) > > > > > return 0; > > > > > if (order == MB_NUM_ORDERS(sb)) > > > > > order--; > > > > > + if (WARN_ON_ONCE(order > MB_NUM_ORDERS(sb))) > > > > > + order = MB_NUM_ORDERS(sb) - 1; > > > > Hey Baokun, > > > Hi Ojaswin, > > > > Thanks for fixing this. This patch looks good to me, feel free to add: > > > > > > > > Reviewed-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > > > Thanks for the review! > > > > my comments after this are less about the patch and more about some > > > > thoughts on the working of average fragment lists. > > > > > > > > So going through the v2 and this patch got me thinking about what really > > > > is going to happen when a user tries to allocate 32768 blocks which is also > > > > the maximum value we could have in say ac->ac_g_ex.fe_len. > > > > > > > > When this happens, ext4_mb_regular_allocator() will directly set the > > > > criteria as CR_GOAL_LEN_FAST. Now, we'll follow: > > > > > > > > ext4_mb_choose_next_group_goal_fast() > > > > for (i=mb_avg_fragment_size_order(); i < MB_NUM_ORDERS; i++) { .. } > > > > > > > > Here, mb_avg_fragment_siz_order() will do something like: > > > > > > > > order = fls(32768) - 2 = 14 > > > > ... > > > > if (order == MB_NUM_ORDERS(sb)) > > > > order--; > > > > > > > > return order; > > > > > > > > And we'll look in the fragment list[13] and since none of the groups > > > > there would have 32768 blocks free (since we dont track it here) we'll > > > > unnecessarily traverse the full list before falling to CR_BEST_AVAIL_LEN > > > > (this will become a noop due to the way order and min_order > > > > are calculated) and eventually to CR_GOAL_LEN_SLOW where we might get > > > > something or end up splitting. > > > That's not quite right, in ext4_mb_choose_next_group_goal_fast() even > > > though we're looking for the group with order 13, the group with 32768 > > > free blocks is also in there. So after passing ext4_mb_good_group() in > > > ext4_mb_find_good_group_avg_frag_lists(), we get a group with 32768 > > > free blocks. And in ext4_mb_choose_next_group_best_avail() we were > > Hey Baokun, > > > > So IIUC, a BG with 32768 blocks free will have bb_fragments = 0 and in > > mb_update_avg_fragment_size() we exit early if a BG has bb_fragments = 0 > > hence it won't show up in the order 13 list. > Hello Ojaswin, > > This sounded strange, so I added the following debugging information: > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index c65fac9b8c72..c6ec423e2971 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -1212,6 +1212,7 @@ void ext4_mb_generate_buddy(struct super_block *sb, > i = mb_find_next_zero_bit(bitmap, max, i); > } > grp->bb_fragments = fragments; > + pr_err(">>> greoup: %u, bb_free: %d, bb_fragments: %d\n", > grp->bb_group, grp->bb_free, grp->bb_fragments); > > if (free != grp->bb_free) { > ext4_grp_locked_error(sb, group, 0, 0, > > > Then mount an ext4 image , wait for a moment , and got the > following printout: > > >>> greoup: 6, bb_free: 32768, bb_fragments: 1 > >>> greoup: 5, bb_free: 31741, bb_fragments: 1 > >>> greoup: 4, bb_free: 32768, bb_fragments: 1 > >>> greoup: 3, bb_free: 31741, bb_fragments: 1 > >>> greoup: 2, bb_free: 32768, bb_fragments: 1 > >>> greoup: 1, bb_free: 31741, bb_fragments: 1 > >>> greoup: 0, bb_free: 23511, bb_fragments: 1 Ahh right, the fragments would be 1 and not zero, my bad! Thanks for testing this. > > > supposed to allocate blocks quickly by trim order, so it's necessary > > > here too. So there are no unnecessary loops here. > > > > > > But this will trigger the freshly added WARN_ON_ONCE, so in the > > > new iteration I need to change it to: > > > > > > if (WARN_ON_ONCE(order > MB_NUM_ORDERS(ac->ac_sb) + 1)) > > > order = MB_NUM_ORDERS(ac->ac_sb) - 1; > > > > > > In addition, when the block size is 4k, there are these limitations: > > > > > > 1) Limit the maximum size of the data allocation estimate to 8M in > > > ext4_mb_normalize_request(). > > > 2) #define MAX_WRITEPAGES_EXTENT_LEN 2048 > > > 3) #define DIO_MAX_BLOCKS 4096 > > > 4) Metadata is generally not allocated in many blocks at a time > > > > > > So it seems that only group_prealloc will allocate more than 2048 > > > blocks at a time. > > > > > > And I've tried removing those 8M/2048/4096 limits before, but the > > > performance of DIO write barely changed, and it doesn't look like > > > the performance bottleneck is here in the number of blocks allocated > > > at a time at the moment. > > Ohh that's interesting, on paper I think it does seem like it should > > improve the performance. I think if CR_GOAL_LEN_FAST can start including > > blocks which are completely empty, and lift those restrictions then we > > might see better performance. I'll try to play around a bit with this as > > well. > > > > Regards, > > ojaswin > > > OK, waiting for your good news. :) > > -- > With Best Regards, > Baokun Li > .