Re: kernel BUG at mm/huge_memory.c:2736(linux 5.10.29)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 23, 2021 at 10:28 PM Wang Yugui <wangyugui@xxxxxxxxxxxx> wrote:
>
> Hi,
>
> > On Fri, Apr 23, 2021 at 1:07 AM Wang Yugui <wangyugui@xxxxxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > > With this patch, the problem yet not happen after 4 tests(5.10.x).
> > >
> > > With this patch , another problem happened at 6th test.
> > >
> > > kernel BUG at mm/huge_memory.c:2343!
> > > static void unmap_page(struct page *page)
> > > {
> > >     enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
> > >         TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
> > >     bool unmap_success;
> > >
> > >     VM_BUG_ON_PAGE(!PageHead(page), page);
> > >
> > >     if (PageAnon(page))
> > >         ttu_flags |= TTU_SPLIT_FREEZE;
> > >
> > >     unmap_success = try_to_unmap(page, ttu_flags);
> > > L2343:VM_BUG_ON_PAGE(!unmap_success,page);
> >
> > Thanks for running the test. This is what I expected from the debug
> > patch. It means try_to_unmap() didn't unmap the huge page
> > successfully. The huge page is PTE-mapped, try_to_unmap() is supposed
> > to unmap every mapped subpage. But it seems it didn't unmap any
> > subpage at all (the refcount of the huge page is 512 per the log from
> > earlier email).
> >
> > By reading the code, I didn't figure out what went wrong yet. You
> > mentioned that the 5.4.x kernel is fine, so may you try to do some
> > bisect?
>
> This maybe happen on some memory reclaim path.

Yes, it does. The stack trace already showed so.

>
> Our application need to process the file about 300G-400G.
>
> We have 4 servers, two servers have 192G memory, 1 server has 512G
> memory, 1 server has 768G memory.
>
> If the memory(total memory * 10 / 12 - 120G) is enough to process the
> files, no temp file is needed. else, we will write the buffer to temp
> file, and continue to process another part.
>
> this problem happened on the server with 192G memory && kernel 5.10.x,
> but yet not happen on the server with kernel 5.4.x  ||
> total memory>=512G.
>
> so this maybe a timing problem too. debug code maybe userful than code bisect?

If you want to add some debug code, there would be a lot of places to add.

I'd suggest you try to add some debug code in page_vma_mapped_walk()
first, particularly in check_pte(). I suspect it didn't find valid
PTEs since the unmap itself would be quite simple. (I assumed
CONFIG_MIGRATION is enabled).

Then you can try to add debug code in try_to_unmap_one().

And I'm not sure if khugepaged may have race condition with split, it
sounds unlikely, but collapsing PTE-mapped THP support was added in
v5.8, so you may try to reproduce this on v5.7 to narrow it down.

>
> fedora with new linux kernel configured with CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y,
> so new linux kernel with CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y maybe not well
> tested?
>
> Best Regards
> Wang Yugui (wangyugui@xxxxxxxxxxxx)
> 2021/04/24
>




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux