huge_pmd_share() is normally called with vma lock held already, it lets us feel like we don't really need the walker lock. But that's not true, because we only took the vma lock for "current" vma, but not all the vma pgtables we're going to walk upon. Taking each vma lock may lead to deadlock and hard to order. The safe approach is just to use the walker lock which guarantees the pgtable page being alive, then we should use get_page_unless_zero() rather than get_page() only, to make sure the pgtable page is not being freed at the meantime. Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> --- mm/hugetlb.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 61a1fa678172..5ef883184885 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7008,6 +7008,13 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, spinlock_t *ptl; i_mmap_lock_read(mapping); + + /* + * NOTE: even if we've got the vma read lock, here we still need to + * take the walker lock, because we're not walking the current vma, + * but some other mm's! + */ + hugetlb_walker_lock(); vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; @@ -7016,12 +7023,15 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, if (saddr) { spte = huge_pte_offset(svma->vm_mm, saddr, vma_mmu_pagesize(svma)); - if (spte) { - get_page(virt_to_page(spte)); + /* + * When page ref==0, it means it's probably being + * freed; continue with the next one. + */ + if (spte && get_page_unless_zero(virt_to_page(spte))) break; - } } } + hugetlb_walker_unlock(); if (!spte) goto out; -- 2.37.3