[RFC PATCH 5.10.y] mm: hugetlb: call huge_pte_offset with i_mmap_rwsem held

Puranjay Mohan <pjy@xxxxxxxxxx> · Tue, 12 Nov 2024 14:16:01 +0000

huge_pte_offset() walks the page table to the find the pte associated
with the huge page. the PMD page can be shared with another process and
pmd unsharing is possible (so the PUD-ranged pgtable page can go away
while the page table walk is under way. It can be done by a pmd unshare
with a follow up munmap() on the other process).

Protect against this race by taking i_mmap_rwsem while the page table
walk is going on and till the pointer to the PMD is being used.

The upstream kernel has a new lock [1] for fixing this issue and
backporting the whole series is not trivial. This patch is my attempt at
fixing this issue and I am sending this as an RFC to receive feedback if we
can fix it using another method.

Once I receive the feedback and we have a path forward, I will send that
patch to all stable branches that have this issue.

[1] https://lwn.net/Articles/908092/

Here is an example kernel panic due to the issue being fixed in this
patch:

 Unable to handle kernel paging request at virtual address ffffffffc0000698
 Mem abort info:
 ESR = 0x96000004
 EC = 0x25: DABT (current EL), IL = 32 bits
 SET = 0, FnV = 0
 EA = 0, S1PTW = 0
 Data abort info:
 ISV = 0, ISS = 0x00000004
 CM = 0, WnR = 0
 swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000729ec000
 [ffffffffc0000698] pgd=0000000000000000, p4d=0000000000000000
 Internal error: Oops: 96000004 [#1] SMP
 Modules linked in: xt_tcpmss ip6table_filter tcp_diag [.....]
 CPU: 3 PID: 62456 Comm: postgres Not tainted 5.10.184-175.731.amzn2.aarch64 #1
 Hardware name: Amazon EC2 caspianr1g.16xlarge/, BIOS 1.0 11/1/2018
 pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
 pc : huge_pte_offset+0x88/0x118
 lr : hugetlb_fault+0x60/0x5f0
 sp : ffff80001c6d3d00
 x29: ffff80001c6d3d00 x28: ffff0003cdfa0000
 x27: 0000000000000000 x26: ffff0003ce90d6a8
 x25: 0000000000000007 x24: 00000a60da660000
 x23: ffff0003ce90d640 x22: ffff800012256ed8
 x21: 00000a60da600000 x20: ffff00040388d130
 x19: ffff00040388d130 x18: 0000000000000000
 x17: 0000000000000000 x16: 0000000000000000
 x15: 0000000000000000 x14: 0000000000000000
 x13: 0000000000000000 x12: 0000000000000000
 x11: 0000000000000000 x10: 0000000000000000
 x9 : ffff800010345880 x8 : 0000000000000000
 x7 : 0000000040000000 x6 : 0000000040000000
 x5 : 0000000000000183 x4 : 00000000000000d3
 x3 : ffffffffc0000000 x2 : 0000000000200000
 x1 : 00000a60da600000 x0 : ffffffffc0000698
 Call trace:
 huge_pte_offset+0x88/0x118
 handle_mm_fault+0x1b0/0x240
 do_page_fault+0x150/0x420
 do_translation_fault+0xb8/0xf4
 do_mem_abort+0x48/0xa8
 el0_da+0x44/0x80
 el0_sync_handler+0xe0/0x120
 Code: eb00005f 54000380 d3557424 8b040c60 (f8647863)
 ---[ end trace 6cffaf3375de3ad9 ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 Kernel Offset: disabled
 CPU features: 0x0804800e,7a00a238
 Memory Limit: 2048 MB
 ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

Signed-off-by: Puranjay Mohan <pjy@xxxxxxxxxx>
---
 mm/hugetlb.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 02b7c8f9b0e87..a991b62afac4e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4545,7 +4545,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct address_space *mapping;
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
+	mapping = vma->vm_file->f_mapping;
 
+	i_mmap_lock_read(mapping);
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
 		/*
@@ -4556,10 +4558,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		entry = huge_ptep_get(ptep);
 		if (unlikely(is_hugetlb_entry_migration(entry))) {
 			migration_entry_wait_huge(vma, mm, ptep);
+			i_mmap_unlock_read(mapping);
 			return 0;
-		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
+			i_mmap_unlock_read(mapping);
 			return VM_FAULT_HWPOISON_LARGE |
 				VM_FAULT_SET_HINDEX(hstate_index(h));
+		}
 	}
 
 	/*
@@ -4573,8 +4578,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * is OK, as huge_pte_alloc will return the same value unless
 	 * something has changed.
 	 */
-	mapping = vma->vm_file->f_mapping;
-	i_mmap_lock_read(mapping);
 	ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
 	if (!ptep) {
 		i_mmap_unlock_read(mapping);
-- 
2.40.1