huge_pte_offset() walks the page table to the find the pte associated with the huge page. the PMD page can be shared with another process and pmd unsharing is possible (so the PUD-ranged pgtable page can go away while the page table walk is under way. It can be done by a pmd unshare with a follow up munmap() on the other process). Protect against this race by taking i_mmap_rwsem while the page table walk is going on and till the pointer to the PMD is being used. The upstream kernel has a new lock [1] for fixing this issue and backporting the whole series is not trivial. This patch is my attempt at fixing this issue and I am sending this as an RFC to receive feedback if we can fix it using another method. Once I receive the feedback and we have a path forward, I will send that patch to all stable branches that have this issue. [1] https://lwn.net/Articles/908092/ Here is an example kernel panic due to the issue being fixed in this patch: Unable to handle kernel paging request at virtual address ffffffffc0000698 Mem abort info: ESR = 0x96000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000729ec000 [ffffffffc0000698] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] SMP Modules linked in: xt_tcpmss ip6table_filter tcp_diag [.....] CPU: 3 PID: 62456 Comm: postgres Not tainted 5.10.184-175.731.amzn2.aarch64 #1 Hardware name: Amazon EC2 caspianr1g.16xlarge/, BIOS 1.0 11/1/2018 pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--) pc : huge_pte_offset+0x88/0x118 lr : hugetlb_fault+0x60/0x5f0 sp : ffff80001c6d3d00 x29: ffff80001c6d3d00 x28: ffff0003cdfa0000 x27: 0000000000000000 x26: ffff0003ce90d6a8 x25: 0000000000000007 x24: 00000a60da660000 x23: ffff0003ce90d640 x22: ffff800012256ed8 x21: 00000a60da600000 x20: ffff00040388d130 x19: ffff00040388d130 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800010345880 x8 : 0000000000000000 x7 : 0000000040000000 x6 : 0000000040000000 x5 : 0000000000000183 x4 : 00000000000000d3 x3 : ffffffffc0000000 x2 : 0000000000200000 x1 : 00000a60da600000 x0 : ffffffffc0000698 Call trace: huge_pte_offset+0x88/0x118 handle_mm_fault+0x1b0/0x240 do_page_fault+0x150/0x420 do_translation_fault+0xb8/0xf4 do_mem_abort+0x48/0xa8 el0_da+0x44/0x80 el0_sync_handler+0xe0/0x120 Code: eb00005f 54000380 d3557424 8b040c60 (f8647863) ---[ end trace 6cffaf3375de3ad9 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x0804800e,7a00a238 Memory Limit: 2048 MB ---[ end Kernel panic - not syncing: Oops: Fatal exception ]--- Signed-off-by: Puranjay Mohan <pjy@xxxxxxxxxx> --- mm/hugetlb.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 02b7c8f9b0e87..a991b62afac4e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4545,7 +4545,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, struct address_space *mapping; int need_wait_lock = 0; unsigned long haddr = address & huge_page_mask(h); + mapping = vma->vm_file->f_mapping; + i_mmap_lock_read(mapping); ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); if (ptep) { /* @@ -4556,10 +4558,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, entry = huge_ptep_get(ptep); if (unlikely(is_hugetlb_entry_migration(entry))) { migration_entry_wait_huge(vma, mm, ptep); + i_mmap_unlock_read(mapping); return 0; - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) + } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) { + i_mmap_unlock_read(mapping); return VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); + } } /* @@ -4573,8 +4578,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * is OK, as huge_pte_alloc will return the same value unless * something has changed. */ - mapping = vma->vm_file->f_mapping; - i_mmap_lock_read(mapping); ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); if (!ptep) { i_mmap_unlock_read(mapping); -- 2.40.1