Re: [PATCH v3 4/9] ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists()

Baokun Li <libaokun1@xxxxxxxxxx> · Wed, 20 Mar 2024 09:30:57 +0800

On 2024/3/20 2:25, Ojaswin Mujoo wrote:
On Tue, Mar 19, 2024 at 06:05:53PM +0800, Baokun Li wrote:
On 2024/3/18 20:39, Ojaswin Mujoo wrote:
On Thu, Mar 14, 2024 at 10:09:01PM +0800, Baokun Li wrote:
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -831,6 +831,8 @@ static int mb_avg_fragment_size_order(struct super_block *sb, ext4_grpblk_t len)
      return 0;
    if (order == MB_NUM_ORDERS(sb))
      order--;
+ if (WARN_ON_ONCE(order > MB_NUM_ORDERS(sb)))
+   order = MB_NUM_ORDERS(sb) - 1;
Hey Baokun,
Hi Ojaswin,
Thanks for fixing this. This patch looks good to me, feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
Thanks for the review!
my comments after this are less about the patch and more about some
thoughts on the working of average fragment lists.

So going through the v2 and this patch got me thinking about what really
is going to happen when a user tries to allocate 32768 blocks which is also
the maximum value we could have in say ac->ac_g_ex.fe_len.

When this happens, ext4_mb_regular_allocator() will directly set the
criteria as CR_GOAL_LEN_FAST. Now, we'll follow:

ext4_mb_choose_next_group_goal_fast()
    for (i=mb_avg_fragment_size_order(); i < MB_NUM_ORDERS; i++) { .. }

Here, mb_avg_fragment_siz_order() will do something like:

    order = fls(32768) - 2 = 14
    ...
    if (order == MB_NUM_ORDERS(sb))
      order--;

    return order;

And we'll look in the fragment list[13] and since none of the groups
there would have 32768 blocks free (since we dont track it here) we'll
unnecessarily traverse the full list before falling to CR_BEST_AVAIL_LEN
(this will become a noop due to the way order and min_order
are calculated) and eventually to CR_GOAL_LEN_SLOW where we might get
something or end up splitting.
That's not quite right, in ext4_mb_choose_next_group_goal_fast() even
though we're looking for the group with order 13, the group with 32768
free blocks is also in there. So after passing ext4_mb_good_group() in
ext4_mb_find_good_group_avg_frag_lists(), we get a group with 32768
free blocks. And in ext4_mb_choose_next_group_best_avail() we were
Hey Baokun,

So IIUC, a BG with 32768 blocks free will have bb_fragments = 0 and in
mb_update_avg_fragment_size() we exit early if a BG has bb_fragments = 0
hence it won't show up in the order 13 list.
Hello Ojaswin,

This sounded strange, so I added the following debugging information:

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c65fac9b8c72..c6ec423e2971 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1212,6 +1212,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
                        i = mb_find_next_zero_bit(bitmap, max, i);
        }
        grp->bb_fragments = fragments;
+       pr_err(">>> greoup: %u, bb_free: %d, bb_fragments: %d\n", 
grp->bb_group, grp->bb_free, grp->bb_fragments);

        if (free != grp->bb_free) {
                ext4_grp_locked_error(sb, group, 0, 0,


Then mount an ext4 image , wait for a moment , and got the
following printout:

>>> greoup: 6, bb_free: 32768, bb_fragments: 1
>>> greoup: 5, bb_free: 31741, bb_fragments: 1
>>> greoup: 4, bb_free: 32768, bb_fragments: 1
>>> greoup: 3, bb_free: 31741, bb_fragments: 1
>>> greoup: 2, bb_free: 32768, bb_fragments: 1
>>> greoup: 1, bb_free: 31741, bb_fragments: 1
>>> greoup: 0, bb_free: 23511, bb_fragments: 1
supposed to allocate blocks quickly by trim order, so it's necessary
here too. So there are no unnecessary loops here.

But this will trigger the freshly added WARN_ON_ONCE, so in the
new iteration I need to change it to:

if (WARN_ON_ONCE(order > MB_NUM_ORDERS(ac->ac_sb) + 1))
         order = MB_NUM_ORDERS(ac->ac_sb) - 1;

In addition, when the block size is 4k, there are these limitations:

1) Limit the maximum size of the data allocation estimate to 8M in
     ext4_mb_normalize_request().
2) #define MAX_WRITEPAGES_EXTENT_LEN 2048
3) #define DIO_MAX_BLOCKS 4096
4) Metadata is generally not allocated in many blocks at a time

So it seems that only group_prealloc will allocate more than 2048
blocks at a time.

And I've tried removing those 8M/2048/4096 limits before, but the
performance of DIO write barely changed, and it doesn't look like
the performance bottleneck is here in the number of blocks allocated
at a time at the moment.
Ohh that's interesting, on paper I think it does seem like it should
improve the performance. I think if CR_GOAL_LEN_FAST can start including
blocks which are completely empty, and lift those restrictions then we
might see better performance. I'll try to play around a bit with this as
well.

Regards,
ojaswin

OK, waiting for your good news.

--
With Best Regards,
Baokun Li
.