After enabling khugepaged to handle VMAs of any size, it may happen that the process faults on a VMA other than the VMA under collapse, and both these VMAs span the same PTE table. As a result, the fault handler will install a new PTE table after khugepaged isolates the PTE table. Therefore, scan the PTE table, retrieve all VMAs, and write lock them. Note that, rmap can still reach the PTE table from folios not under collapse; this is fine since it does not interfere with the PTEs under collapse, nor the folios under collapse, nor can rmap fill the PMD. Signed-off-by: Dev Jain <dev.jain@xxxxxxx> --- mm/khugepaged.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 048f990d8507..e1c2c5b89f6d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1139,6 +1139,23 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, return SCAN_SUCCEED; } +static void take_vma_locks_per_pte(struct mm_struct *mm, unsigned long haddress) +{ + struct vm_area_struct *vma; + unsigned long start = haddress; + unsigned long end = haddress + HPAGE_PMD_SIZE; + + while (start < end) { + vma = vma_lookup(mm, start); + if (!vma) { + start += PAGE_SIZE; + continue; + } + vma_start_write(vma); + start = vma->vm_end; + } +} + static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long address, struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd, struct folio *folio) @@ -1270,7 +1287,9 @@ static int vma_collapse_anon_folio(struct mm_struct *mm, unsigned long address, if (result != SCAN_SUCCEED) goto out; - vma_start_write(vma); + /* Faulting may fill the PMD after flush; lock all VMAs mapping this PTE */ + take_vma_locks_per_pte(mm, haddress); + anon_vma_lock_write(vma->anon_vma); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, haddress, -- 2.30.2