Hi, > On Sat, Apr 17, 2021 at 1:33 AM Wang Yugui <wangyugui@xxxxxxxxxxxx> wrote: > > > > Hi, > > > > > On Mon, Apr 12, 2021 at 3:07 AM Wang Yugui <wangyugui@xxxxxxxxxxxx> wrote: > > > > > > > > Hi, > > > > > > > > kernel BUG at mm/huge_memory.c:2736(linux 5.10.29) is triggered > > > > by some files write test. > > > > > > > > mm/huge_memory.c: > > > > if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) { > > > > pr_alert("total_mapcount: %u, page_count(): %u\n", > > > > mapcount, count); > > > > if (PageTail(page)) > > > > dump_page(head, NULL); > > > > dump_page(page, "total_mapcount(head) > 0"); > > > > L2736: BUG(); > > > > } > > > > > > We just can tell the mapcount of the page is not zero from the current > > > log, it might mean the unmap_page() call is failed. It seems you have > > > CONFIG_DEBUG_VM enabled, could you please paste more log? There is > > > "VM_BUG_ON_PAGE(!unmap_success, page)" in unmap_page(). It should be > > > able to tell us if unmap_page() is failed or not, or something else > > > happened. > > > > This is the full dmesg output > > > > [63080.331513] huge_memory: total_mapcount: 511, page_count(): 512 > > [63080.332167] page:00000000d2e1a982 refcount:512 mapcount:0 mapping:0000000000000000 index:0x7fe260582 pfn:0x676a00 > > [63080.332167] head:00000000d2e1a982 order:9 compound_mapcount:0 compound_pincount:0 > > [63080.332167] anon flags: 0x17ffffc009001d(locked|uptodate|dirty|lru|head|swapbacked) > > [63080.332167] raw: 0017ffffc009001d ffffc93cda0d0008 ffffc93cd9ab0008 ffff8f21be9f0cb9 > > [63080.332167] raw: 00000007fe260582 0000000000000000 00000200ffffffff ffff8f1021810000 > > [63080.332167] page->mem_cgroup:ffff8f1021810000 > > [63080.332167] page:00000000bc78ac24 refcount:512 mapcount:1 mapping:0000000000000000 index:0x7fe260584 pfn:0x676a02 > > [63080.332167] head:00000000d2e1a982 order:9 compound_mapcount:0 compound_pincount:0 > > [63080.332167] anon flags: 0x17ffffc009001d(locked|uptodate|dirty|lru|head|swapbacked) > > [63080.332167] raw: 0017ffffc0000000 ffffc93cd9da8001 dead000000000000 ffffc93d428d0098 > > [63080.332167] raw: ffffa002cd183bf0 0000000000000000 0000000000000000 0000000000000000 > > [63080.332167] head: 0017ffffc009001d ffffc93cda0d0008 ffffc93cd9ab0008 ffff8f21be9f0cb9 > > [63080.332167] head: 00000007fe260582 0000000000000000 00000200ffffffff ffff8f1021810000 > > [63080.332167] page dumped because: total_mapcount(head) > 0 > > Added Kirill in this loop too, he may have some insights. > > Thanks a lot for pasting the full log. It seems the BUG_ON in > unmap_page() and VM_BUG_ON_PAGE(compound_mapcount(head), head) were > not triggered. But the dumped page shows its total_mapcount is 511. It > means 511 subpages of the huge page are PTE mapped. It seems all tail > pages are PTE mapped. It may be because unmap_page() is failed or they > are mapped again after unmap_page(). > > But the VM_BUG_ON_PAGE just checks compound_mapcount, and it seems > page_mapcount() call in unmap_page() also just checks > compound_mapcount and the mapcount of the head page. If the mapcount > of the head page is 0 and compound_mapcount is also 0, try_to_unmap() > considers unmap is successful. > > So we can't tell which case it is although I don't think of how > unmap_page() could fail for this case. I think we should check the > total mapcount in try_to_unmap() instead. > > Can you please try the below debug patch (untested) to help narrow > down the problem? > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index ae907a9c2050..c10e89be1c99 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2726,7 +2726,7 @@ int split_huge_page_to_list(struct page *page, > struct list_head *list) > } > > unmap_page(head); > - VM_BUG_ON_PAGE(compound_mapcount(head), head); > + VM_BUG_ON_PAGE(total_mapcount(head), head); > > /* block interrupt reentry in xa_lock and spinlock */ > local_irq_disable(); > diff --git a/mm/rmap.c b/mm/rmap.c > index b0fc27e77d6d..537dfc557744 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1777,7 +1777,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags) > else > rmap_walk(page, &rwc); > > - return !page_mapcount(page) ? true : false; > + return !total_mapcount(page) ? true : false; > } > > /** > > With this patch, the problem yet not happen after 4 tests(5.10.x). By the way, the problem does not happen in 5.4.x.(>about 120 tests) does this match the code version? Best Regards Wang Yugui (wangyugui@xxxxxxxxxxxx) 2021/04/23