On Wed, Jul 31, 2024 at 02:21:03PM +0200, David Hildenbrand wrote: > We recently made GUP's common page table walking code to also walk hugetlb > VMAs without most hugetlb special-casing, preparing for the future of > having less hugetlb-specific page table walking code in the codebase. > Turns out that we missed one page table locking detail: page table locking > for hugetlb folios that are not mapped using a single PMD/PUD. > > Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB > hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the > page tables, will perform a pte_offset_map_lock() to grab the PTE table > lock. > > However, hugetlb that concurrently modifies these page tables would > actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the > locks would differ. Something similar can happen right now with hugetlb > folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. > > This issue can be reproduced [1], for example triggering: > > [ 3105.936100] ------------[ cut here ]------------ > [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188 > [ 3105.944634] Modules linked in: [...] > [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1 > [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024 > [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 3105.991108] pc : try_grab_folio+0x11c/0x188 > [ 3105.994013] lr : follow_page_pte+0xd8/0x430 > [ 3105.996986] sp : ffff80008eafb8f0 > [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43 > [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48 > [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978 > [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001 > [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000 > [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000 > [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0 > [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080 > [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000 > [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000 > [ 3106.047957] Call trace: > [ 3106.049522] try_grab_folio+0x11c/0x188 > [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 > [ 3106.055527] follow_page_mask+0x1a0/0x2b8 > [ 3106.058118] __get_user_pages+0xf0/0x348 > [ 3106.060647] faultin_page_range+0xb0/0x360 > [ 3106.063651] do_madvise+0x340/0x598 > > Let's make huge_pte_lockptr() effectively use the same PT locks as any > core-mm page table walker would. Add ptep_lockptr() to obtain the PTE > page table lock using a pte pointer -- unfortunately we cannot convert > pte_lockptr() because virt_to_page() doesn't work with kmap'ed page > tables we can have with CONFIG_HIGHPTE. > > Take care of PTE tables possibly spanning multiple pages, and take care of > CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For > example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect > with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which > would end up just mapping to the per-MM PT lock. > > There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb > folio being mapped using two PTE page tables. While hugetlb wants to take > the PMD table lock, core-mm would grab the PTE table lock of one of both > PTE page tables. In such corner cases, we have to make sure that both > locks match, which is (fortunately!) currently guaranteed for 8xx as it > does not support SMP and consequently doesn't use split PT locks. > > [1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@xxxxxxxxxx/ > > Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") > Reviewed-by: James Houghton <jthoughton@xxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> > Cc: Peter Xu <peterx@xxxxxxxxxx> > Cc: Oscar Salvador <osalvador@xxxxxxx> > Cc: Muchun Song <muchun.song@xxxxxxxxx> > Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> Nitpick: I wonder whether some of the lines can be simplified if we write it downwards from PUD, like, huge_pte_lockptr() { if (size >= PUD_SIZE) return pud_lockptr(...); if (size >= PMD_SIZE) return pmd_lockptr(...); /* Sub-PMD only applies to !CONFIG_HIGHPTE, see pte_alloc_huge() */ WARN_ON(IS_ENABLED(CONFIG_HIGHPTE)); return ptep_lockptr(...); } The ">=" checks should avoid checking over pgtable level, iiuc. The other nitpick is, I didn't yet find any arch that use non-zero order page for pte pgtables. I would give it a shot with dropping the mask thing then see what explodes (which I don't expect any, per my read..), but yeah I understand we saw some already due to other things, so I think it's fine in this hugetlb path (that we're removing) we do a few more math if you think that's easier for you. Acked-by: Peter Xu <peterx@xxxxxxxxxx> Thanks, -- Peter Xu