On Tue, Mar 19, 2024 at 06:05:53PM +0800, Baokun Li wrote: > On 2024/3/18 20:39, Ojaswin Mujoo wrote: > > On Thu, Mar 14, 2024 at 10:09:01PM +0800, Baokun Li wrote: > > > --- a/fs/ext4/mballoc.c > > > +++ b/fs/ext4/mballoc.c > > > @@ -831,6 +831,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len) > > > return 0; > > > if (order == MB_NUM_ORDERS(sb)) > > > order--; > > > + if (WARN_ON_ONCE(order > MB_NUM_ORDERS(sb))) > > > + order = MB_NUM_ORDERS(sb) - 1; > > Hey Baokun, > Hi Ojaswin, > > > > Thanks for fixing this. This patch looks good to me, feel free to add: > > > > Reviewed-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> > Thanks for the review! > > my comments after this are less about the patch and more about some > > thoughts on the working of average fragment lists. > > > > So going through the v2 and this patch got me thinking about what really > > is going to happen when a user tries to allocate 32768 blocks which is also > > the maximum value we could have in say ac->ac_g_ex.fe_len. > > > > When this happens, ext4_mb_regular_allocator() will directly set the > > criteria as CR_GOAL_LEN_FAST. Now, we'll follow: > > > > ext4_mb_choose_next_group_goal_fast() > > for (i=mb_avg_fragment_size_order(); i < MB_NUM_ORDERS; i++) { .. } > > > > Here, mb_avg_fragment_siz_order() will do something like: > > > > order = fls(32768) - 2 = 14 > > ... > > if (order == MB_NUM_ORDERS(sb)) > > order--; > > > > return order; > > > > And we'll look in the fragment list[13] and since none of the groups > > there would have 32768 blocks free (since we dont track it here) we'll > > unnecessarily traverse the full list before falling to CR_BEST_AVAIL_LEN > > (this will become a noop due to the way order and min_order > > are calculated) and eventually to CR_GOAL_LEN_SLOW where we might get > > something or end up splitting. > That's not quite right, in ext4_mb_choose_next_group_goal_fast() even > though we're looking for the group with order 13, the group with 32768 > free blocks is also in there. So after passing ext4_mb_good_group() in > ext4_mb_find_good_group_avg_frag_lists(), we get a group with 32768 > free blocks. And in ext4_mb_choose_next_group_best_avail() we were Hey Baokun, So IIUC, a BG with 32768 blocks free will have bb_fragments = 0 and in mb_update_avg_fragment_size() we exit early if a BG has bb_fragments = 0 hence it won't show up in the order 13 list. > supposed to allocate blocks quickly by trim order, so it's necessary > here too. So there are no unnecessary loops here. > > But this will trigger the freshly added WARN_ON_ONCE, so in the > new iteration I need to change it to: > > if (WARN_ON_ONCE(order > MB_NUM_ORDERS(ac->ac_sb) + 1)) > order = MB_NUM_ORDERS(ac->ac_sb) - 1; > > In addition, when the block size is 4k, there are these limitations: > > 1) Limit the maximum size of the data allocation estimate to 8M in > ext4_mb_normalize_request(). > 2) #define MAX_WRITEPAGES_EXTENT_LEN 2048 > 3) #define DIO_MAX_BLOCKS 4096 > 4) Metadata is generally not allocated in many blocks at a time > > So it seems that only group_prealloc will allocate more than 2048 > blocks at a time. > > And I've tried removing those 8M/2048/4096 limits before, but the > performance of DIO write barely changed, and it doesn't look like > the performance bottleneck is here in the number of blocks allocated > at a time at the moment. Ohh that's interesting, on paper I think it does seem like it should improve the performance. I think if CR_GOAL_LEN_FAST can start including blocks which are completely empty, and lift those restrictions then we might see better performance. I'll try to play around a bit with this as well. Regards, ojaswin > > Thanks, > Baokun > > I think something more optimal would be to: > > > > 1. Add another entry to average fragment lists for completely empty > > groups. (As a sidenote i think we should use something like MB_NUM_FRAG_ORDER > > instead of MB_NUM_ORDERS in calculating limits related to average > > fragment lists since the NUM_ORDERS seems to be the buddy max order ie > > 8192 blocks only valid for CR_POWER2 and shouldn't really limit the > > fragment size lists) > > > > 2. If we don't want to go with 1 (maybe there's some history for that), > > then probably should exit early from CR_GOAL_LEN_FAST so that we don't > > iterate there. > > > > Would like to hear your thoughts on it Baokun, Jan. > > > > Regards, > > ojaswin > >