Re: [linux-next:master] [mm/readahead] 13da30d6f9: BUG:soft_lockup-CPU##stuck_for#s![usemem:#]

Yafang Shao <laoar.shao@xxxxxxxxx> · Fri, 6 Dec 2024 13:38:53 +0800

On Fri, Dec 6, 2024 at 10:27 AM Oliver Sang <oliver.sang@xxxxxxxxx> wrote:
>
> hi, Yafang,
>
> On Tue, Dec 03, 2024 at 05:33:16PM +0800, Yafang Shao wrote:
> > On Tue, Dec 3, 2024 at 11:04 AM Oliver Sang <oliver.sang@xxxxxxxxx> wrote:
> > >
> > > hi, Yafang,
> > >
> > > On Tue, Dec 03, 2024 at 10:14:50AM +0800, Yafang Shao wrote:
> > > > On Fri, Nov 29, 2024 at 11:19 PM kernel test robot
> > > > <oliver.sang@xxxxxxxxx> wrote:
> > > > >
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![usemem:#]" on:
> > > > >
> > > > > commit: 13da30d6f9150dff876f94a3f32d555e484ad04f ("mm/readahead: fix large folio support in async readahead")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > > >
> > > > > [test failed on linux-next/master cfba9f07a1d6aeca38f47f1f472cfb0ba133d341]
> > > > >
> > > > > in testcase: vm-scalability
> > > > > version: vm-scalability-x86_64-6f4ef16-0_20241103
> > > > > with following parameters:
> > > > >
> > > > >         runtime: 300s
> > > > >         test: mmap-xread-seq-mt
> > > > >         cpufreq_governor: performance
> > > > >
> > > > >
> > > > >
> > > > > config: x86_64-rhel-9.4
> > > > > compiler: gcc-12
> > > > > test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
> > > > >
> > > > > (please refer to attached dmesg/kmsg for entire log/backtrace)
> > > > >
> > > > >
> > > > >
> > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > > > the same patch/commit), kindly add following tags
> > > > > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > > > > | Closes: https://lore.kernel.org/oe-lkp/202411292300.61edbd37-lkp@xxxxxxxxx
> > > > >
> > > > >
> > >
> > > [...]
> > >
> > > >
> > > > Is this issue consistently reproducible?
> > > > I attempted to reproduce it using the mmap-xread-seq-mt test case but
> > > > was unsuccessful.
> > >
> > > in our tests, the issue is quite persistent. as below, 100% reproduced in all
> > > 8 runs, keeps clean on parent.
> > >
> > > d1aa0c04294e2988 13da30d6f9150dff876f94a3f32
> > > ---------------- ---------------------------
> > >        fail:runs  %reproduction    fail:runs
> > >            |             |             |
> > >            :8          100%           8:8     dmesg.BUG:soft_lockup-CPU##stuck_for#s![usemem:#]
> > >            :8          100%           8:8     dmesg.Kernel_panic-not_syncing:softlockup:hung_tasks
> > >
> > > to avoid any env issue, we rebuild kernel and rerun more to check. if still
> > > consistently reproduced, we will follow your further requests. thanks
> >
> > Although I’ve made extensive attempts, I haven’t been able to
> > reproduce the issue. My best guess is that, in the non-MADV_HUGEPAGE
> > case, ra->size might be increasing to an unexpectedly large value. If
> > that’s the case, I believe the issue can be resolved with the
> > following additional change:
> >
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 9b8a48e736c6..e30132bc2593 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -385,8 +385,6 @@ static unsigned long get_next_ra_size(struct
> > file_ra_state *ra,
> >                 return 4 * cur;
> >         if (cur <= max / 2)
> >                 return 2 * cur;
> > -       if (cur > max)
> > -               return cur;
> >         return max;
> >  }
> >
> > @@ -644,7 +642,11 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >                         1UL << order);
> >         if (index == expected) {
> >                 ra->start += ra->size;
> > -               ra->size = get_next_ra_size(ra, max_pages);
> > +               /*
> > +                * For the MADV_HUGEPAGE case, the ra->size might be larger than
> > +                * the max_pages.
> > +                */
> > +               ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
> >                 ra->async_size = ra->size;
> >                 goto readit;
> >         }
> >
> > Could you please test this if you can consistently reproduce the bug?
>
> by this patch, we confirmed the issue gone on both platforms.
>
> Tested-by: kernel test robot <oliver.sang@xxxxxxxxx>

Great! Thanks for your work. I'll send a new version.

--
Regards
Yafang