Re: [linux-next:master] [mm/readahead] 13da30d6f9: BUG:soft_lockup-CPU##stuck_for#s![usemem:#]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hi, Yafang,

On Tue, Dec 03, 2024 at 05:33:16PM +0800, Yafang Shao wrote:
> On Tue, Dec 3, 2024 at 11:04 AM Oliver Sang <oliver.sang@xxxxxxxxx> wrote:
> >
> > hi, Yafang,
> >
> > On Tue, Dec 03, 2024 at 10:14:50AM +0800, Yafang Shao wrote:
> > > On Fri, Nov 29, 2024 at 11:19 PM kernel test robot
> > > <oliver.sang@xxxxxxxxx> wrote:
> > > >
> > > >
> > > >
> > > > Hello,
> > > >
> > > > kernel test robot noticed "BUG:soft_lockup-CPU##stuck_for#s![usemem:#]" on:
> > > >
> > > > commit: 13da30d6f9150dff876f94a3f32d555e484ad04f ("mm/readahead: fix large folio support in async readahead")
> > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > >
> > > > [test failed on linux-next/master cfba9f07a1d6aeca38f47f1f472cfb0ba133d341]
> > > >
> > > > in testcase: vm-scalability
> > > > version: vm-scalability-x86_64-6f4ef16-0_20241103
> > > > with following parameters:
> > > >
> > > >         runtime: 300s
> > > >         test: mmap-xread-seq-mt
> > > >         cpufreq_governor: performance
> > > >
> > > >
> > > >
> > > > config: x86_64-rhel-9.4
> > > > compiler: gcc-12
> > > > test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
> > > >
> > > > (please refer to attached dmesg/kmsg for entire log/backtrace)
> > > >
> > > >
> > > >
> > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > > the same patch/commit), kindly add following tags
> > > > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > > > | Closes: https://lore.kernel.org/oe-lkp/202411292300.61edbd37-lkp@xxxxxxxxx
> > > >
> > > >
> >
> > [...]
> >
> > >
> > > Is this issue consistently reproducible?
> > > I attempted to reproduce it using the mmap-xread-seq-mt test case but
> > > was unsuccessful.
> >
> > in our tests, the issue is quite persistent. as below, 100% reproduced in all
> > 8 runs, keeps clean on parent.
> >
> > d1aa0c04294e2988 13da30d6f9150dff876f94a3f32
> > ---------------- ---------------------------
> >        fail:runs  %reproduction    fail:runs
> >            |             |             |
> >            :8          100%           8:8     dmesg.BUG:soft_lockup-CPU##stuck_for#s![usemem:#]
> >            :8          100%           8:8     dmesg.Kernel_panic-not_syncing:softlockup:hung_tasks
> >
> > to avoid any env issue, we rebuild kernel and rerun more to check. if still
> > consistently reproduced, we will follow your further requests. thanks
> 
> Although I’ve made extensive attempts, I haven’t been able to
> reproduce the issue. My best guess is that, in the non-MADV_HUGEPAGE
> case, ra->size might be increasing to an unexpectedly large value. If
> that’s the case, I believe the issue can be resolved with the
> following additional change:
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9b8a48e736c6..e30132bc2593 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -385,8 +385,6 @@ static unsigned long get_next_ra_size(struct
> file_ra_state *ra,
>                 return 4 * cur;
>         if (cur <= max / 2)
>                 return 2 * cur;
> -       if (cur > max)
> -               return cur;
>         return max;
>  }
> 
> @@ -644,7 +642,11 @@ void page_cache_async_ra(struct readahead_control *ractl,
>                         1UL << order);
>         if (index == expected) {
>                 ra->start += ra->size;
> -               ra->size = get_next_ra_size(ra, max_pages);
> +               /*
> +                * For the MADV_HUGEPAGE case, the ra->size might be larger than
> +                * the max_pages.
> +                */
> +               ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
>                 ra->async_size = ra->size;
>                 goto readit;
>         }
> 
> Could you please test this if you can consistently reproduce the bug?

by this patch, we confirmed the issue gone on both platforms.

Tested-by: kernel test robot <oliver.sang@xxxxxxxxx>

below d18114f8dcb33d7ed6216673903 is just your patch

on Cooper Lake in our original report

d1aa0c04294e2988 13da30d6f9150dff876f94a3f32 d18114f8dcb33d7ed6216673903
---------------- --------------------------- ---------------------------
       fail:runs  %reproduction    fail:runs  %reproduction    fail:runs
           |             |             |             |             |
           :20          75%          15:20           0%            :20    dmesg.BUG:soft_lockup-CPU##stuck_for#s![usemem:#]
           :20          75%          15:20           0%            :20    dmesg.Kernel_panic-not_syncing:softlockup:hung_tasks

on another Ice Lake platform

d1aa0c04294e2988 13da30d6f9150dff876f94a3f32 d18114f8dcb33d7ed6216673903
---------------- --------------------------- ---------------------------
       fail:runs  %reproduction    fail:runs  %reproduction    fail:runs
           |             |             |             |             |
           :10          50%           5:10           0%            :20    dmesg.BUG:soft_lockup-CPU##stuck_for#s![usemem:#]
           :10          50%           5:10           0%            :20    dmesg.Kernel_panic-not_syncing:softlockup:hung_tasks


> 
> --
> Regards
> Yafang




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux