On 11/29/22 14:35, Peter Xu wrote: > In hugetlb_fault(), there used to have a special path to handle swap entry > at the entrance using huge_pte_offset(). That's unsafe because > huge_pte_offset() for a pmd sharable range can access freed pgtables if > without any lock to protect the pgtable from being freed after pmd unshare. Thanks. That special path has been there for a long time. I should have given it more thought when adding lock. However, I was only considering the 'stale' pte case as opposed to freed. > Here the simplest solution to make it safe is to move the swap handling to > be after the vma lock being held. We may need to take the fault mutex on > either migration or hwpoison entries now (also the vma lock, but that's > really needed), however neither of them is hot path. > > Note that the vma lock cannot be released in hugetlb_fault() when the > migration entry is detected, because in migration_entry_wait_huge() the > pgtable page will be used again (by taking the pgtable lock), so that also > need to be protected by the vma lock. Modify migration_entry_wait_huge() > so that it must be called with vma read lock held, and properly release the > lock in __migration_entry_wait_huge(). > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > --- > include/linux/swapops.h | 6 ++++-- > mm/hugetlb.c | 32 +++++++++++++++----------------- > mm/migrate.c | 25 +++++++++++++++++++++---- > 3 files changed, 40 insertions(+), 23 deletions(-) Looks good with one small change noted below, Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > diff --git a/include/linux/swapops.h b/include/linux/swapops.h > index 27ade4f22abb..09b22b169a71 100644 > --- a/include/linux/swapops.h > +++ b/include/linux/swapops.h > @@ -335,7 +335,8 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, > extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, > unsigned long address); > #ifdef CONFIG_HUGETLB_PAGE > -extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl); > +extern void __migration_entry_wait_huge(struct vm_area_struct *vma, > + pte_t *ptep, spinlock_t *ptl); > extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte); > #endif /* CONFIG_HUGETLB_PAGE */ > #else /* CONFIG_MIGRATION */ > @@ -364,7 +365,8 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, > static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, > unsigned long address) { } > #ifdef CONFIG_HUGETLB_PAGE > -static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { } > +static inline void __migration_entry_wait_huge(struct vm_area_struct *vma, > + pte_t *ptep, spinlock_t *ptl) { } > static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { } > #endif /* CONFIG_HUGETLB_PAGE */ > static inline int is_writable_migration_entry(swp_entry_t entry) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index dfe677fadaf8..776e34ccf029 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -5826,22 +5826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > int need_wait_lock = 0; > unsigned long haddr = address & huge_page_mask(h); > > - ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); > - if (ptep) { > - /* > - * Since we hold no locks, ptep could be stale. That is > - * OK as we are only making decisions based on content and > - * not actually modifying content here. > - */ > - entry = huge_ptep_get(ptep); > - if (unlikely(is_hugetlb_entry_migration(entry))) { > - migration_entry_wait_huge(vma, ptep); > - return 0; > - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) > - return VM_FAULT_HWPOISON_LARGE | > - VM_FAULT_SET_HINDEX(hstate_index(h)); > - } > - Before acquiring the vma_lock, there is this comment: /* * Acquire vma lock before calling huge_pte_alloc and hold * until finished with ptep. This prevents huge_pmd_unshare from * being called elsewhere and making the ptep no longer valid. * * ptep could have already be assigned via hugetlb_walk(). That * is OK, as huge_pte_alloc will return the same value unless * something has changed. */ The second sentence in that comment about ptep being already assigned no longer applies and can be deleted. -- Mike Kravetz