On 22/03/2022 17:56, Catalin Marinas wrote:
Adding Steve as well who wrote the initial hugetlb code for arm64 (and
not trimming the quoted text).
Hey Catalin,
On Mon, Mar 21, 2022 at 05:34:18PM +0100, David Hildenbrand wrote:
On 08.03.22 00:06, Mike Kravetz wrote:
On 2/28/22 16:26, Mike Kravetz wrote:
On 2/28/22 07:39, David Hildenbrand wrote:
playing with anonymous CONT hugetlb pages on aarch64, I stumbled over the following VM_BUG_ON:
[ 124.770288] ------------[ cut here ]------------
[ 124.774899] kernel BUG at mm/mmu_gather.c:70!
[ 124.779244] Internal error: Oops - BUG: 0 [#1] SMP
[ 124.784022] Modules linked in: mlx4_ib ib_uverbs ib_core mlx4_en rfkill vfat fat acpi_ipmi joydev ipmi_ssif igb mlx4_core ipmi_devintf ipmi_msghandler cppc_cpufreq fuse zram ip_tables xfs uas usb_storage dwc3 ulpi ast udc_core i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm_ttm_helper ttm crct10dif_ce drm ghash_ce sbsa_gwdt i2c_xgene_slimpro xgene_hwmon ahci_platform gpio_dwapb xhci_plat_hcd
[ 124.823045] CPU: 16 PID: 1160 Comm: test Not tainted 5.16.11-200.fc35.aarch64 #1
[ 124.830428] Hardware name: Lenovo HR350A 7X35CTO1WW /HR350A , BIOS hve104r-1.15 02/26/2021
[ 124.840240] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 124.847189] pc : __tlb_remove_page_size+0x88/0xe4
[ 124.851885] lr : __unmap_hugepage_range+0x260/0x504
[ 124.856751] sp : ffff80000f6f3ae0
[ 124.860053] x29: ffff80000f6f3ae0 x28: ffff00080b639d24 x27: ffff000802504080
[ 124.867176] x26: fffffc00210f8000 x25: 0000000000000000 x24: ffff80000a9e8750
[ 124.874299] x23: 0000ffff8da20000 x22: ffff000804f0c190 x21: 0000000000010000
[ 124.881423] x20: ffff80000f6f3cb0 x19: ffff80000f6f3cb0 x18: 0000000000000000
[ 124.888545] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 124.895668] x14: 0000000000000000 x13: 0008000000000000 x12: 0008000000000080
[ 124.902791] x11: 0008000000000000 x10: 00f80008c3e00f43 x9 : ffff800008404e60
[ 124.909914] x8 : 0846000000000000 x7 : 0000000000000000 x6 : ffff80000a8a4000
[ 124.917038] x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000001000
[ 124.924161] x2 : 0000000000010000 x1 : fffffc00210f8000 x0 : 0000000000000000
[ 124.931284] Call trace:
[ 124.933718] __tlb_remove_page_size+0x88/0xe4
[ 124.938062] __unmap_hugepage_range+0x260/0x504
[ 124.942580] __unmap_hugepage_range_final+0x24/0x40
[ 124.947445] unmap_single_vma+0x100/0x11c
[ 124.951443] unmap_vmas+0x7c/0xf4
[ 124.954746] unmap_region+0xa4/0xf0
[ 124.958222] __do_munmap+0x1b8/0x50c
[ 124.961785] __vm_munmap+0x74/0x120
[ 124.965261] __arm64_sys_munmap+0x40/0x54
[ 124.969257] invoke_syscall+0x50/0x120
[ 124.972995] el0_svc_common.constprop.0+0x4c/0x100
[ 124.977774] do_el0_svc+0x34/0xa0
[ 124.981077] el0_svc+0x30/0xd0
[ 124.984120] el0t_64_sync_handler+0xa4/0x130
[ 124.988377] el0t_64_sync+0x1a4/0x1a8
[ 124.992028] Code: b4000140 f9001660 29410402 17fffff4 (d4210000)
[ 124.998109] ---[ end trace a74a76b89c9f2d88 ]---
[ 125.002900] ------------[ cut here ]------------
I'm running with 64k hugetlb on 4k aarch64. Reproducer:
#define _GNU_SOURCE
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <linux/memfd.h>
void main(void)
{
const size_t size = 64*1024;
unsigned long cur;
char *area;
int fd;
fd = memfd_create("test", MFD_HUGETLB | MFD_HUGE_64KB);
ftruncate(fd, size);
area = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
memset(area, 0, size);
munmap(area, size);
}
I assume __unmap_hugepage_range() does a
a) tlb_remove_huge_tlb_entry()
-> for sz != PMD_SIZE and sz != PUD_SIZE, this calls __tlb_remove_tlb_entry()
-> __tlb_remove_tlb_entry() is a NOP on aarch64. __tlb_adjust_range() isn't called.
b) tlb_remove_page_size()
-> __tlb_remove_page_size() runs into VM_BUG_ON(!tlb->end);
Not sure if this is just "ok" and we don't have to adjust the range or if there is
some tlb range adjustment missing.
To me, it looks like we are missing range adjustment in the case where
hugetlb page size != PMD_SIZE and != PUD_SIZE. Not sure how those ranges
are being flushed because as you note tlb_remove_huge_tlb_entry is pretty
much a NOP in this case on aarch64.
Cc'ing Will and Peter as they most recently changed this code. Commit
2631ed00b049 "tlb: mmu_gather: add tlb_flush_*_range APIs" removed an
unconditional call to __tlb_adjust_range() in tlb_remove_huge_tlb_entry.
That might have taken care of range adjustments in earlier versions of
the code. Not exactly sure what is needed now.
I verified that commit 2631ed00b049 caused the VM_BUG when it removed the
unconditional call to __tlb_adjust_range(). However, I need some assistance
on the proper solution.
Just adding the __tlb_adjust_range() call to tlb_remove_huge_tlb_entry in
the case where size != PMD_SIZE and != PUD_SIZE will silence the BUG.
However, one outcome of 2631ed00b049 is that cleared_p* is set if
__tlb_adjust_range is ever called.
It 'seems' that tlb_flush_pte_range() should be called for the CONT PTE case
on arm64, and tlb_flush_pmd_range() should be called for CONT PMD. But, this
would require an arch specific version of tlb_remove_huge_tlb_entry.
FYI - This same issue should exist on other architectures that support
hugetlb page sizes != PMD_SIZE and != PUD_SIZE.
Suggestions on how to proceed?
Unfortunately, I have absolutely no clue what would be the right thing
to do. Any aarch64 CONT experts?
At a quick look, we wouldn't have a problem with missing TLB flushing
since huge_ptep_get_and_clear() does this for contiguous PTEs. Not sure
why it needs this though, Steve added it in commit d8bdcff28764. I think
we can defer this flushing to tlb_remove_page_size().
The TLB flush in huge_ptep_get_and_clear() was added because it was
called by hugetlb_change_protection() without any flushing. The concern
was that, without the flush, it would be possible to get to different
views of the same contiguous huge page. (Being contiguous they were not
changed en masse atomically).
As Mike noted, tlb_remove_huge_tlb_entry() could call
tlb_flush_pte_range() and I think this would work even when dealing with
a CONT PMD case (just more flushing at page granularity). I need to
check the range TLBI ops in the arm64 __flush_tlb_range() but if they
are fine, we can do this as a quick fix.
A better solution is probably to allow arch-specific
tlb_remove_huge_tlb_entry() that understands whether it's a contiguous
pmd or pte and sets the clear_p* accordingly.
Yeah an arch specific tlb_remove_huge_tlb_entry() would make sense. The
current one assumes huge ptes can either be PMD_SIZE or PUD_SIZE.
Cheers,
--
Steve