Re: VM_BUG_ON(!tlb->end) on munmap() with CONT hugetlb pages

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Mon, 7 Mar 2022 15:06:07 -0800

On 2/28/22 16:26, Mike Kravetz wrote:
> On 2/28/22 07:39, David Hildenbrand wrote:
>> Hi,
>>
>> playing with anonymous CONT hugetlb pages on aarch64, I stumbled over the following VM_BUG_ON:
>>
>> [  124.770288] ------------[ cut here ]------------
>> [  124.774899] kernel BUG at mm/mmu_gather.c:70!
>> [  124.779244] Internal error: Oops - BUG: 0 [#1] SMP
>> [  124.784022] Modules linked in: mlx4_ib ib_uverbs ib_core mlx4_en rfkill vfat fat acpi_ipmi joydev ipmi_ssif igb mlx4_core ipmi_devintf ipmi_msghandler cppc_cpufreq fuse zram ip_tables xfs uas usb_storage dwc3 ulpi ast udc_core i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm_ttm_helper ttm crct10dif_ce drm ghash_ce sbsa_gwdt i2c_xgene_slimpro xgene_hwmon ahci_platform gpio_dwapb xhci_plat_hcd
>> [  124.823045] CPU: 16 PID: 1160 Comm: test Not tainted 5.16.11-200.fc35.aarch64 #1
>> [  124.830428] Hardware name: Lenovo HR350A            7X35CTO1WW    /HR350A     , BIOS hve104r-1.15 02/26/2021
>> [  124.840240] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [  124.847189] pc : __tlb_remove_page_size+0x88/0xe4
>> [  124.851885] lr : __unmap_hugepage_range+0x260/0x504
>> [  124.856751] sp : ffff80000f6f3ae0
>> [  124.860053] x29: ffff80000f6f3ae0 x28: ffff00080b639d24 x27: ffff000802504080
>> [  124.867176] x26: fffffc00210f8000 x25: 0000000000000000 x24: ffff80000a9e8750
>> [  124.874299] x23: 0000ffff8da20000 x22: ffff000804f0c190 x21: 0000000000010000
>> [  124.881423] x20: ffff80000f6f3cb0 x19: ffff80000f6f3cb0 x18: 0000000000000000
>> [  124.888545] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>> [  124.895668] x14: 0000000000000000 x13: 0008000000000000 x12: 0008000000000080
>> [  124.902791] x11: 0008000000000000 x10: 00f80008c3e00f43 x9 : ffff800008404e60
>> [  124.909914] x8 : 0846000000000000 x7 : 0000000000000000 x6 : ffff80000a8a4000
>> [  124.917038] x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000001000
>> [  124.924161] x2 : 0000000000010000 x1 : fffffc00210f8000 x0 : 0000000000000000
>> [  124.931284] Call trace:
>> [  124.933718]  __tlb_remove_page_size+0x88/0xe4
>> [  124.938062]  __unmap_hugepage_range+0x260/0x504
>> [  124.942580]  __unmap_hugepage_range_final+0x24/0x40
>> [  124.947445]  unmap_single_vma+0x100/0x11c
>> [  124.951443]  unmap_vmas+0x7c/0xf4
>> [  124.954746]  unmap_region+0xa4/0xf0
>> [  124.958222]  __do_munmap+0x1b8/0x50c
>> [  124.961785]  __vm_munmap+0x74/0x120
>> [  124.965261]  __arm64_sys_munmap+0x40/0x54
>> [  124.969257]  invoke_syscall+0x50/0x120
>> [  124.972995]  el0_svc_common.constprop.0+0x4c/0x100
>> [  124.977774]  do_el0_svc+0x34/0xa0
>> [  124.981077]  el0_svc+0x30/0xd0
>> [  124.984120]  el0t_64_sync_handler+0xa4/0x130
>> [  124.988377]  el0t_64_sync+0x1a4/0x1a8
>> [  124.992028] Code: b4000140 f9001660 29410402 17fffff4 (d4210000) 
>> [  124.998109] ---[ end trace a74a76b89c9f2d88 ]---
>> [  125.002900] ------------[ cut here ]------------
>>
>>
>> I'm running with 64k hugetlb on 4k aarch64. Reproducer:
>>
>> #define _GNU_SOURCE
>> #include <string.h>
>> #include <unistd.h>
>> #include <sys/mman.h>
>> #include <linux/memfd.h>
>>
>> void main(void)
>> {
>>         const size_t size = 64*1024;
>>         unsigned long cur;
>>         char *area;
>>         int fd;
>>
>>         fd = memfd_create("test", MFD_HUGETLB | MFD_HUGE_64KB);
>>         ftruncate(fd, size);
>>         area = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
>>
>>         memset(area, 0, size);
>>
>>         munmap(area, size);
>> }
>>
>>
>>
>> I assume __unmap_hugepage_range() does a
>>
>> a) tlb_remove_huge_tlb_entry()
>>
>> -> for sz != PMD_SIZE and sz != PUD_SIZE, this calls __tlb_remove_tlb_entry()
>>
>> -> __tlb_remove_tlb_entry() is a NOP on aarch64. __tlb_adjust_range() isn't called.
>>
>> b) tlb_remove_page_size()
>>
>> -> __tlb_remove_page_size() runs into VM_BUG_ON(!tlb->end);
>>
>>
>> Not sure if this is just "ok" and we don't have to adjust the range or if there is
>> some tlb range adjustment missing.
>>
> 
> To me, it looks like we are missing range adjustment in the case where
> hugetlb page size != PMD_SIZE and != PUD_SIZE.  Not sure how those ranges
> are being flushed because as you note tlb_remove_huge_tlb_entry is pretty
> much a NOP in this case on aarch64.
> 
> Cc'ing Will and Peter as they most recently changed this code.  Commit
> 2631ed00b049 "tlb: mmu_gather: add tlb_flush_*_range APIs" removed an
> unconditional call to __tlb_adjust_range() in tlb_remove_huge_tlb_entry.
> That might have taken care of range adjustments in earlier versions of
> the code.  Not exactly sure what is needed now.

I verified that commit 2631ed00b049 caused the VM_BUG when it removed the
unconditional call to __tlb_adjust_range().  However, I need some assistance
on the proper solution.

Just adding the __tlb_adjust_range() call to tlb_remove_huge_tlb_entry in
the case where size != PMD_SIZE and != PUD_SIZE will silence the BUG.
However, one outcome of 2631ed00b049 is that cleared_p* is set if
__tlb_adjust_range is ever called.

It 'seems' that tlb_flush_pte_range() should be called for the CONT PTE case
on arm64, and tlb_flush_pmd_range() should be called for CONT PMD.  But, this
would require an arch specific version of tlb_remove_huge_tlb_entry.

FYI - This same issue should exist on other architectures that support
hugetlb page sizes != PMD_SIZE and != PUD_SIZE.

Suggestions on how to proceed?
-- 
Mike Kravetz