The patch titled Subject: mm/khugepaged: allow pte_offset_map[_lock]() to fail has been added to the -mm mm-unstable branch. Its filename is mm-khugepaged-allow-pte_offset_map-to-fail.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-khugepaged-allow-pte_offset_map-to-fail.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Hugh Dickins <hughd@xxxxxxxxxx> Subject: mm/khugepaged: allow pte_offset_map[_lock]() to fail Date: Thu, 8 Jun 2023 18:42:40 -0700 (PDT) __collapse_huge_page_swapin(): don't drop the map after every pte, it only has to be dropped by do_swap_page(); give up if pte_offset_map() fails; trace_mm_collapse_huge_page_swapin() at the end, with result; fix comment on returned result; fix vmf.pgoff, though it's not used. collapse_huge_page(): use pte_offset_map_lock() on the _pmd returned from clearing; allow failure, but it should be impossible there. hpage_collapse_scan_pmd() and collapse_pte_mapped_thp() allow for pte_offset_map_lock() failure. Link: https://lkml.kernel.org/r/6513e85-d798-34ec-3762-7c24ffb9329@xxxxxxxxxx Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> Reviewed-by: Yang Shi <shy828301@xxxxxxxxx> Cc: Alistair Popple <apopple@xxxxxxxxxx> Cc: Anshuman Khandual <anshuman.khandual@xxxxxxx> Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx> Cc: Christophe Leroy <christophe.leroy@xxxxxxxxxx> Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx> Cc: David Hildenbrand <david@xxxxxxxxxx> Cc: "Huang, Ying" <ying.huang@xxxxxxxxx> Cc: Ira Weiny <ira.weiny@xxxxxxxxx> Cc: Jason Gunthorpe <jgg@xxxxxxxx> Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> Cc: Lorenzo Stoakes <lstoakes@xxxxxxxxx> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> Cc: Miaohe Lin <linmiaohe@xxxxxxxxxx> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Cc: Mike Rapoport (IBM) <rppt@xxxxxxxxxx> Cc: Minchan Kim <minchan@xxxxxxxxxx> Cc: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> Cc: Pavel Tatashin <pasha.tatashin@xxxxxxxxxx> Cc: Peter Xu <peterx@xxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> Cc: Ralph Campbell <rcampbell@xxxxxxxxxx> Cc: Ryan Roberts <ryan.roberts@xxxxxxx> Cc: SeongJae Park <sj@xxxxxxxxxx> Cc: Song Liu <song@xxxxxxxxxx> Cc: Steven Price <steven.price@xxxxxxx> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx> Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> Cc: Will Deacon <will@xxxxxxxxxx> Cc: Yu Zhao <yuzhao@xxxxxxxxxx> Cc: Zack Rusin <zackr@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/khugepaged.c | 72 +++++++++++++++++++++++++++++++--------------- 1 file changed, 49 insertions(+), 23 deletions(-) --- a/mm/khugepaged.c~mm-khugepaged-allow-pte_offset_map-to-fail +++ a/mm/khugepaged.c @@ -991,9 +991,8 @@ static int check_pmd_still_valid(struct * Only done if hpage_collapse_scan_pmd believes it is worthwhile. * * Called and returns without pte mapped or spinlocks held. - * Note that if false is returned, mmap_lock will be released. + * Returns result: if not SCAN_SUCCEED, mmap_lock has been released. */ - static int __collapse_huge_page_swapin(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd, @@ -1002,23 +1001,35 @@ static int __collapse_huge_page_swapin(s int swapped_in = 0; vm_fault_t ret = 0; unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE); + int result; + pte_t *pte = NULL; for (address = haddr; address < end; address += PAGE_SIZE) { struct vm_fault vmf = { .vma = vma, .address = address, - .pgoff = linear_page_index(vma, haddr), + .pgoff = linear_page_index(vma, address), .flags = FAULT_FLAG_ALLOW_RETRY, .pmd = pmd, }; - vmf.pte = pte_offset_map(pmd, address); - vmf.orig_pte = *vmf.pte; - if (!is_swap_pte(vmf.orig_pte)) { - pte_unmap(vmf.pte); - continue; + if (!pte++) { + pte = pte_offset_map(pmd, address); + if (!pte) { + mmap_read_unlock(mm); + result = SCAN_PMD_NULL; + goto out; + } } + + vmf.orig_pte = *pte; + if (!is_swap_pte(vmf.orig_pte)) + continue; + + vmf.pte = pte; ret = do_swap_page(&vmf); + /* Which unmaps pte (after perhaps re-checking the entry) */ + pte = NULL; /* * do_swap_page returns VM_FAULT_RETRY with released mmap_lock. @@ -1027,24 +1038,29 @@ static int __collapse_huge_page_swapin(s * resulting in later failure. */ if (ret & VM_FAULT_RETRY) { - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0); /* Likely, but not guaranteed, that page lock failed */ - return SCAN_PAGE_LOCK; + result = SCAN_PAGE_LOCK; + goto out; } if (ret & VM_FAULT_ERROR) { mmap_read_unlock(mm); - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0); - return SCAN_FAIL; + result = SCAN_FAIL; + goto out; } swapped_in++; } + if (pte) + pte_unmap(pte); + /* Drain LRU add pagevec to remove extra pin on the swapped in pages */ if (swapped_in) lru_add_drain(); - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1); - return SCAN_SUCCEED; + result = SCAN_SUCCEED; +out: + trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result); + return result; } static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, @@ -1144,9 +1160,6 @@ static int collapse_huge_page(struct mm_ address + HPAGE_PMD_SIZE); mmu_notifier_invalidate_range_start(&range); - pte = pte_offset_map(pmd, address); - pte_ptl = pte_lockptr(mm, pmd); - pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ /* * This removes any huge TLB entry from the CPU so we won't allow @@ -1161,13 +1174,18 @@ static int collapse_huge_page(struct mm_ mmu_notifier_invalidate_range_end(&range); tlb_remove_table_sync_one(); - spin_lock(pte_ptl); - result = __collapse_huge_page_isolate(vma, address, pte, cc, - &compound_pagelist); - spin_unlock(pte_ptl); + pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); + if (pte) { + result = __collapse_huge_page_isolate(vma, address, pte, cc, + &compound_pagelist); + spin_unlock(pte_ptl); + } else { + result = SCAN_PMD_NULL; + } if (unlikely(result != SCAN_SUCCEED)) { - pte_unmap(pte); + if (pte) + pte_unmap(pte); spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); /* @@ -1251,6 +1269,11 @@ static int hpage_collapse_scan_pmd(struc memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); pte = pte_offset_map_lock(mm, pmd, address, &ptl); + if (!pte) { + result = SCAN_PMD_NULL; + goto out; + } + for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++, _address += PAGE_SIZE) { pte_t pteval = *_pte; @@ -1620,8 +1643,10 @@ int collapse_pte_mapped_thp(struct mm_st * lockless_pages_from_mm() and the hardware page walker can access page * tables while all the high-level locks are held in write mode. */ - start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); result = SCAN_FAIL; + start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); + if (!start_pte) + goto drop_immap; /* step 1: check all mapped PTEs are to the right huge page */ for (i = 0, addr = haddr, pte = start_pte; @@ -1695,6 +1720,7 @@ drop_hpage: abort: pte_unmap_unlock(start_pte, ptl); +drop_immap: i_mmap_unlock_write(vma->vm_file->f_mapping); goto drop_hpage; } _ Patches currently in -mm which might be from hughd@xxxxxxxxxx are arm-allow-pte_offset_map-to-fail.patch arm64-allow-pte_offset_map-to-fail.patch arm64-hugetlb-pte_alloc_huge-pte_offset_huge.patch ia64-hugetlb-pte_alloc_huge-pte_offset_huge.patch m68k-allow-pte_offset_map-to-fail.patch microblaze-allow-pte_offset_map-to-fail.patch mips-update_mmu_cache-can-replace-__update_tlb.patch mips-update_mmu_cache-can-replace-__update_tlb-fix.patch parisc-add-pte_unmap-to-balance-get_ptep.patch parisc-unmap_uncached_pte-use-pte_offset_kernel.patch parisc-hugetlb-pte_alloc_huge-pte_offset_huge.patch powerpc-kvmppc_unmap_free_pmd-pte_offset_kernel.patch powerpc-allow-pte_offset_map-to-fail.patch powerpc-hugetlb-pte_alloc_huge.patch riscv-hugetlb-pte_alloc_huge-pte_offset_huge.patch s390-allow-pte_offset_map_lock-to-fail.patch s390-gmap-use-pte_unmap_unlock-not-spin_unlock.patch sh-hugetlb-pte_alloc_huge-pte_offset_huge.patch sparc-hugetlb-pte_alloc_huge-pte_offset_huge.patch sparc-allow-pte_offset_map-to-fail.patch sparc-iounit-and-iommu-use-pte_offset_kernel.patch x86-allow-get_locked_pte-to-fail.patch x86-sme_populate_pgd-use-pte_offset_kernel.patch xtensa-add-pte_unmap-to-balance-pte_offset_map.patch mm-use-pmdp_get_lockless-without-surplus-barrier.patch mm-migrate-remove-cruft-from-migration_entry_waits.patch mm-pgtable-kmap_local_page-instead-of-kmap_atomic.patch mm-pgtable-allow-pte_offset_map-to-fail.patch mm-filemap-allow-pte_offset_map_lock-to-fail.patch mm-page_vma_mapped-delete-bogosity-in-page_vma_mapped_walk.patch mm-page_vma_mapped-reformat-map_pte-with-less-indentation.patch mm-page_vma_mapped-pte_offset_map_nolock-not-pte_lockptr.patch mm-pagewalkers-action_again-if-pte_offset_map_lock-fails.patch mm-pagewalk-walk_pte_range-allow-for-pte_offset_map.patch mm-vmwgfx-simplify-pmd-pud-mapping-dirty-helpers.patch mm-vmalloc-vmalloc_to_page-use-pte_offset_kernel.patch mm-hmm-retry-if-pte_offset_map-fails.patch mm-userfaultfd-retry-if-pte_offset_map-fails.patch mm-userfaultfd-allow-pte_offset_map_lock-to-fail.patch mm-debug_vm_pgtablepage_table_check-warn-pte-map-fails.patch mm-various-give-up-if-pte_offset_map-fails.patch mm-mprotect-delete-pmd_none_or_clear_bad_unless_trans_huge.patch mm-mremap-retry-if-either-pte_offset_map_lock-fails.patch mm-madvise-clean-up-pte_offset_map_lock-scans.patch mm-madvise-clean-up-force_shm_swapin_readahead.patch mm-swapoff-allow-pte_offset_map-to-fail.patch mm-mglru-allow-pte_offset_map_nolock-to-fail.patch mm-migrate_device-allow-pte_offset_map_lock-to-fail.patch mm-gup-remove-foll_split_pmd-use-of-pmd_trans_unstable.patch mm-huge_memory-split-huge-pmd-under-one-pte_offset_map.patch mm-khugepaged-allow-pte_offset_map-to-fail.patch mm-memory-allow-pte_offset_map-to-fail.patch mm-memory-handle_pte_fault-use-pte_offset_map_nolock.patch mm-pgtable-delete-pmd_trans_unstable-and-friends.patch mm-swap-swap_vma_readahead-do-the-pte_offset_map.patch perf-core-allow-pte_offset_map-to-fail.patch