Re: [PATCH v3 4/9] ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists()

Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> · Wed, 20 Mar 2024 10:49:41 +0530

On Wed, Mar 20, 2024 at 09:30:57AM +0800, Baokun Li wrote:
> On 2024/3/20 2:25, Ojaswin Mujoo wrote:
> > On Tue, Mar 19, 2024 at 06:05:53PM +0800, Baokun Li wrote:
> > > On 2024/3/18 20:39, Ojaswin Mujoo wrote:
> > > > On Thu, Mar 14, 2024 at 10:09:01PM +0800, Baokun Li wrote:
> > > > > --- a/fs/ext4/mballoc.c
> > > > > +++ b/fs/ext4/mballoc.c
> > > > > @@ -831,6 +831,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
> > > > >       return 0;
> > > > >     if (order == MB_NUM_ORDERS(sb))
> > > > >       order--;
> > > > > + if (WARN_ON_ONCE(order > MB_NUM_ORDERS(sb)))
> > > > > +   order = MB_NUM_ORDERS(sb) - 1;
> > > > Hey Baokun,
> > > Hi Ojaswin,
> > > > Thanks for fixing this. This patch looks good to me, feel free to add:
> > > > 
> > > > Reviewed-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
> > > Thanks for the review!
> > > > my comments after this are less about the patch and more about some
> > > > thoughts on the working of average fragment lists.
> > > > 
> > > > So going through the v2 and this patch got me thinking about what really
> > > > is going to happen when a user tries to allocate 32768 blocks which is also
> > > > the maximum value we could have in say ac->ac_g_ex.fe_len.
> > > > 
> > > > When this happens, ext4_mb_regular_allocator() will directly set the
> > > > criteria as CR_GOAL_LEN_FAST. Now, we'll follow:
> > > > 
> > > > ext4_mb_choose_next_group_goal_fast()
> > > >     for (i=mb_avg_fragment_size_order(); i < MB_NUM_ORDERS; i++) { .. }
> > > > 
> > > > Here, mb_avg_fragment_siz_order() will do something like:
> > > > 
> > > >     order = fls(32768) - 2 = 14
> > > >     ...
> > > >     if (order == MB_NUM_ORDERS(sb))
> > > >       order--;
> > > > 
> > > >     return order;
> > > > 
> > > > And we'll look in the fragment list[13] and since none of the groups
> > > > there would have 32768 blocks free (since we dont track it here) we'll
> > > > unnecessarily traverse the full list before falling to CR_BEST_AVAIL_LEN
> > > > (this will become a noop due to the way order and min_order
> > > > are calculated) and eventually to CR_GOAL_LEN_SLOW where we might get
> > > > something or end up splitting.
> > > That's not quite right, in ext4_mb_choose_next_group_goal_fast() even
> > > though we're looking for the group with order 13, the group with 32768
> > > free blocks is also in there. So after passing ext4_mb_good_group() in
> > > ext4_mb_find_good_group_avg_frag_lists(), we get a group with 32768
> > > free blocks. And in ext4_mb_choose_next_group_best_avail() we were
> > Hey Baokun,
> > 
> > So IIUC, a BG with 32768 blocks free will have bb_fragments = 0 and in
> > mb_update_avg_fragment_size() we exit early if a BG has bb_fragments = 0
> > hence it won't show up in the order 13 list.
> Hello Ojaswin,
> 
> This sounded strange, so I added the following debugging information:
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index c65fac9b8c72..c6ec423e2971 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1212,6 +1212,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
>                         i = mb_find_next_zero_bit(bitmap, max, i);
>         }
>         grp->bb_fragments = fragments;
> +       pr_err(">>> greoup: %u, bb_free: %d, bb_fragments: %d\n",
> grp->bb_group, grp->bb_free, grp->bb_fragments);
> 
>         if (free != grp->bb_free) {
>                 ext4_grp_locked_error(sb, group, 0, 0,
> 
> 
> Then mount an ext4 image , wait for a moment , and got the
> following printout:
> 
> >>> greoup: 6, bb_free: 32768, bb_fragments: 1
> >>> greoup: 5, bb_free: 31741, bb_fragments: 1
> >>> greoup: 4, bb_free: 32768, bb_fragments: 1
> >>> greoup: 3, bb_free: 31741, bb_fragments: 1
> >>> greoup: 2, bb_free: 32768, bb_fragments: 1
> >>> greoup: 1, bb_free: 31741, bb_fragments: 1
> >>> greoup: 0, bb_free: 23511, bb_fragments: 1

Ahh right, the fragments would be 1 and not zero, my bad! Thanks
for testing this.

> > > supposed to allocate blocks quickly by trim order, so it's necessary
> > > here too. So there are no unnecessary loops here.
> > > 
> > > But this will trigger the freshly added WARN_ON_ONCE, so in the
> > > new iteration I need to change it to:
> > > 
> > > if (WARN_ON_ONCE(order > MB_NUM_ORDERS(ac->ac_sb) + 1))
> > >          order = MB_NUM_ORDERS(ac->ac_sb) - 1;
> > > 
> > > In addition, when the block size is 4k, there are these limitations:
> > > 
> > > 1) Limit the maximum size of the data allocation estimate to 8M in
> > >      ext4_mb_normalize_request().
> > > 2) #define MAX_WRITEPAGES_EXTENT_LEN 2048
> > > 3) #define DIO_MAX_BLOCKS 4096
> > > 4) Metadata is generally not allocated in many blocks at a time
> > > 
> > > So it seems that only group_prealloc will allocate more than 2048
> > > blocks at a time.
> > > 
> > > And I've tried removing those 8M/2048/4096 limits before, but the
> > > performance of DIO write barely changed, and it doesn't look like
> > > the performance bottleneck is here in the number of blocks allocated
> > > at a time at the moment.
> > Ohh that's interesting, on paper I think it does seem like it should
> > improve the performance. I think if CR_GOAL_LEN_FAST can start including
> > blocks which are completely empty, and lift those restrictions then we
> > might see better performance. I'll try to play around a bit with this as
> > well.
> > 
> > Regards,
> > ojaswin
> > 
> OK, waiting for your good news.

:) 
> 
> -- 
> With Best Regards,
> Baokun Li
> .