On Fri, Apr 23, 2021 at 10:28 PM Wang Yugui <wangyugui@xxxxxxxxxxxx> wrote: > > Hi, > > > On Fri, Apr 23, 2021 at 1:07 AM Wang Yugui <wangyugui@xxxxxxxxxxxx> wrote: > > > > > > Hi, > > > > > > > With this patch, the problem yet not happen after 4 tests(5.10.x). > > > > > > With this patch , another problem happened at 6th test. > > > > > > kernel BUG at mm/huge_memory.c:2343! > > > static void unmap_page(struct page *page) > > > { > > > enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | > > > TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD; > > > bool unmap_success; > > > > > > VM_BUG_ON_PAGE(!PageHead(page), page); > > > > > > if (PageAnon(page)) > > > ttu_flags |= TTU_SPLIT_FREEZE; > > > > > > unmap_success = try_to_unmap(page, ttu_flags); > > > L2343:VM_BUG_ON_PAGE(!unmap_success,page); > > > > Thanks for running the test. This is what I expected from the debug > > patch. It means try_to_unmap() didn't unmap the huge page > > successfully. The huge page is PTE-mapped, try_to_unmap() is supposed > > to unmap every mapped subpage. But it seems it didn't unmap any > > subpage at all (the refcount of the huge page is 512 per the log from > > earlier email). > > > > By reading the code, I didn't figure out what went wrong yet. You > > mentioned that the 5.4.x kernel is fine, so may you try to do some > > bisect? > > This maybe happen on some memory reclaim path. Yes, it does. The stack trace already showed so. > > Our application need to process the file about 300G-400G. > > We have 4 servers, two servers have 192G memory, 1 server has 512G > memory, 1 server has 768G memory. > > If the memory(total memory * 10 / 12 - 120G) is enough to process the > files, no temp file is needed. else, we will write the buffer to temp > file, and continue to process another part. > > this problem happened on the server with 192G memory && kernel 5.10.x, > but yet not happen on the server with kernel 5.4.x || > total memory>=512G. > > so this maybe a timing problem too. debug code maybe userful than code bisect? If you want to add some debug code, there would be a lot of places to add. I'd suggest you try to add some debug code in page_vma_mapped_walk() first, particularly in check_pte(). I suspect it didn't find valid PTEs since the unmap itself would be quite simple. (I assumed CONFIG_MIGRATION is enabled). Then you can try to add debug code in try_to_unmap_one(). And I'm not sure if khugepaged may have race condition with split, it sounds unlikely, but collapsing PTE-mapped THP support was added in v5.8, so you may try to reproduce this on v5.7 to narrow it down. > > fedora with new linux kernel configured with CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y, > so new linux kernel with CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y maybe not well > tested? > > Best Regards > Wang Yugui (wangyugui@xxxxxxxxxxxx) > 2021/04/24 >