On 06/12/23 12:04, Peter Xu wrote: > Huge pmd sharing operates on PUD not PMD, huge_pte_lock() is not suitable > in this case because it should only work for last level pte changes, while > pmd sharing is always one level higher. Right! That lock does not prevent someone else from concurrently modifying the PUD. > Meanwhile, here we're locking over the spte pgtable lock which is even not > a lock for current mm but someone else's. > > It seems even racy on operating on the lock, as after put_page() of the > spte pgtable page logically the page can be released, so at least the > spin_unlock() needs to be done after the put_page(). Agree. > No report I am aware, I'm not even sure whether it'll just work on taking > the spte pmd lock, because while we're holding i_mmap read lock it probably > means the vma interval tree is frozen, all pte allocators over this pud > entry could always find the specific svma and spte page, so maybe they'll > serialize on this spte page lock? It seems they would serialize IF they were trying to instantiate the same shared page. However, I suppose another thread could be trying to instantiate something totally different in the VA range represented by that PUD. In this case, it seems like there would be no synchronization. > Even so, doesn't seem to be expected. > It just seems to be an accident of cb900f412154. > > Fix it with the proper pud lock (which is the mm's page_table_lock). > > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > Cc: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > Fixes: cb900f412154 ("mm, hugetlb: convert hugetlbfs to use split pmd lock") > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > --- > mm/hugetlb.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) Agree with this change. But, it does make one wonder if the pud_clear() in huge_pmd_unshare is safe. Like here, we really should be holding the higher level lock but are holding the PMD lock. -- Mike Kravetz > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index dfa412d8cb30..270ec0ecd5a1 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -7133,7 +7133,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long saddr; > pte_t *spte = NULL; > pte_t *pte; > - spinlock_t *ptl; > > i_mmap_lock_read(mapping); > vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { > @@ -7154,7 +7153,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, > if (!spte) > goto out; > > - ptl = huge_pte_lock(hstate_vma(vma), mm, spte); > + spin_lock(&mm->page_table_lock); > if (pud_none(*pud)) { > pud_populate(mm, pud, > (pmd_t *)((unsigned long)spte & PAGE_MASK)); > @@ -7162,7 +7161,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, > } else { > put_page(virt_to_page(spte)); > } > - spin_unlock(ptl); > + spin_unlock(&mm->page_table_lock); > out: > pte = (pte_t *)pmd_alloc(mm, pud, addr); > i_mmap_unlock_read(mapping); > -- > 2.40.1 >