The patch titled Subject: mm/hugetlb: document huge_pte_offset usage has been added to the -mm mm-unstable branch. Its filename is mm-hugetlb-document-huge_pte_offset-usage.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-hugetlb-document-huge_pte_offset-usage.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Peter Xu <peterx@xxxxxxxxxx> Subject: mm/hugetlb: document huge_pte_offset usage Date: Tue, 29 Nov 2022 14:35:19 -0500 huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a hugetlb address. Normally, it's always safe to walk a generic pgtable as long as we're with the mmap lock held for either read or write, because that guarantees the pgtable pages will always be valid during the process. But it's not true for hugetlbfs, especially shared: hugetlbfs can have its pgtable freed by pmd unsharing, it means that even with mmap lock held for current mm, the PMD pgtable page can still go away from under us if pmd unsharing is possible during the walk. So we have two ways to make it safe even for a shared mapping: (1) If we're with the hugetlb vma lock held for either read/write, it's okay because pmd unshare cannot happen at all. (2) If we're with the i_mmap_rwsem lock held for either read/write, it's okay because even if pmd unshare can happen, the pgtable page cannot be freed from under us. Document it. Link: https://lkml.kernel.org/r/20221129193526.3588187-4-peterx@xxxxxxxxxx Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> Cc: James Houghton <jthoughton@xxxxxxxxxx> Cc: Jann Horn <jannh@xxxxxxxxxx> Cc: Miaohe Lin <linmiaohe@xxxxxxxxxx> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx> Cc: Nadav Amit <nadav.amit@xxxxxxxxx> Cc: Rik van Riel <riel@xxxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) --- a/include/linux/hugetlb.h~mm-hugetlb-document-huge_pte_offset-usage +++ a/include/linux/hugetlb.h @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages; pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long sz); +/* + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE. + * Returns the pte_t* if found, or NULL if the address is not mapped. + * + * Since this function will walk all the pgtable pages (including not only + * high-level pgtable page, but also PUD entry that can be unshared + * concurrently for VM_SHARED), the caller of this function should be + * responsible of its thread safety. One can follow this rule: + * + * (1) For private mappings: pmd unsharing is not possible, so it'll + * always be safe if we're with the mmap sem for either read or write. + * This is normally always the case, IOW we don't need to do anything + * special. + * + * (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged + * pgtable page can go away from under us! It can be done by a pmd + * unshare with a follow up munmap() on the other process), then we + * need either: + * + * (2.1) hugetlb vma lock read or write held, to make sure pmd unshare + * won't happen upon the range (it also makes sure the pte_t we + * read is the right and stable one), or, + * + * (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make + * sure even if unshare happened the racy unmap() will wait until + * i_mmap_rwsem is released. + * + * Option (2.1) is the safest, which guarantees pte stability from pmd + * sharing pov, until the vma lock released. Option (2.2) doesn't protect + * a concurrent pmd unshare, but it makes sure the pgtable page is safe to + * access. + */ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); unsigned long hugetlb_mask_last_page(struct hstate *h); _ Patches currently in -mm which might be from peterx@xxxxxxxxxx are mm-migrate-fix-read-only-page-got-writable-when-recover-pte.patch mm-always-compile-in-pte-markers.patch mm-use-pte-markers-for-swap-errors.patch mm-uffd-sanity-check-write-bit-for-uffd-wp-protected-ptes.patch selftests-vm-use-memfd-for-hugepage-mmap-test.patch mm-thp-re-apply-mkdirty-for-small-pages-after-split.patch mm-hugetlb-let-vma_offset_start-to-return-start.patch mm-hugetlb-dont-wait-for-migration-entry-during-follow-page.patch mm-hugetlb-document-huge_pte_offset-usage.patch mm-hugetlb-move-swap-entry-handling-into-vma-lock-when-faulted.patch mm-hugetlb-make-userfaultfd_huge_must_wait-safe-to-pmd-unshare.patch mm-hugetlb-make-hugetlb_follow_page_mask-safe-to-pmd-unshare.patch mm-hugetlb-make-follow_hugetlb_page-safe-to-pmd-unshare.patch mm-hugetlb-make-walk_hugetlb_range-safe-to-pmd-unshare.patch mm-hugetlb-make-page_vma_mapped_walk-safe-to-pmd-unshare.patch mm-hugetlb-introduce-hugetlb_walk.patch