Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page

Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> · Thu, 25 Aug 2022 18:54:07 +0800

On 8/25/2022 3:25 PM, David Hildenbrand wrote:
Is the primary concern the locking?  If so, I am not sure we have an issue.
As mentioned in your commit message, current code will use
pte_offset_map_lock().  pte_offset_map_lock uses pte_lockptr, and pte_lockptr
will either be the mm wide lock or pmd_page lock.  To me, it seems that
either would provide correct synchronization for CONT-PTE entries.  Am I
missing something or misreading the code?

I started looking at code cleanup suggested by David.  Here is a quick
patch (not tested and likely containing errors) to see if this is a step
in the right direction.

I like it because we get rid of/combine all those follow_huge_p*d
routines.


Yes, see comments below.

 From 35d117a707c1567ddf350554298697d40eace0d7 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Date: Wed, 24 Aug 2022 15:59:15 -0700
Subject: [PATCH] hugetlb: call hugetlb_follow_page_mask for hugetlb pages in
  follow_page_mask

At the beginning of follow_page_mask, there currently is a call to
follow_huge_addr which 'may' handle hugetlb pages.  ia64 is the only
architecture which (incorrectly) provides a follow_huge_addr routine
that does not return error.  Instead, at each level of the page table a
check is made for a hugetlb entry.  If a hugetlb entry is found, a call
to a routine associated with that page table level such as
follow_huge_pmd is made.

All the follow_huge_p*d routines are basically the same.  In addition
huge page size can be derived from the vma, so we know where in the page
table a huge page would reside.  So, replace follow_huge_addr with a
new architecture independent routine which will provide the same
functionality as the follow_huge_p*d routines.  We can then eliminate
the p*d_huge checks in follow_page_mask page table walking as well as
the follow_huge_p*d routines themselves.>
follow_page_mask still has is_hugepd hugetlb checks during page table
walking.  This is due to these checks and follow_huge_pd being
architecture specific.  These can be eliminated if
hugetlb_follow_page_mask can be overwritten by architectures (powerpc)
that need to do follow_huge_pd processing.

But won't the

+	/* hugetlb is special */
+	if (is_vm_hugetlb_page(vma))
+		return hugetlb_follow_page_mask(vma, address, flags);

code route everything via hugetlb_follow_page_mask() and all these
(beloved) hugepd checks would essentially be unreachable?

At least my understanding is that hugepd only applies to hugetlb.

Can't we move the hugepd handling code into hugetlb_follow_page_mask()
as well?

I mean, doesn't follow_hugetlb_page() also have to handle that hugepd
stuff already ... ?

Yes, I also think about this, and I did a simple patch (without testing) 
based on Mike's patch to make it more clean.

diff --git a/mm/gup.c b/mm/gup.c
index d3239ea63159..1003c03dcf78 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -626,14 +626,7 @@ static struct page *follow_pmd_mask(struct 
vm_area_struct *vma,
        pmdval = READ_ONCE(*pmd);
        if (pmd_none(pmdval))
                return no_page_table(vma, flags);
-       if (is_hugepd(__hugepd(pmd_val(pmdval)))) {
-               page = follow_huge_pd(vma, address,
-                                     __hugepd(pmd_val(pmdval)), flags,
-                                     PMD_SHIFT);
-               if (page)
-                       return page;
-               return no_page_table(vma, flags);
-       }
+
 retry:
        if (!pmd_present(pmdval)) {
                /*
@@ -723,14 +716,6 @@ static struct page *follow_pud_mask(struct 
vm_area_struct *vma,
        pud = pud_offset(p4dp, address);
        if (pud_none(*pud))
                return no_page_table(vma, flags);
-       if (is_hugepd(__hugepd(pud_val(*pud)))) {
-               page = follow_huge_pd(vma, address,
-                                     __hugepd(pud_val(*pud)), flags,
-                                     PUD_SHIFT);
-               if (page)
-                       return page;
-               return no_page_table(vma, flags);
-       }
        if (pud_devmap(*pud)) {
                ptl = pud_lock(mm, pud);
                page = follow_devmap_pud(vma, address, pud, flags, 
&ctx->pgmap);
@@ -759,14 +744,6 @@ static struct page *follow_p4d_mask(struct 
vm_area_struct *vma,
        if (unlikely(p4d_bad(*p4d)))
                return no_page_table(vma, flags);

-       if (is_hugepd(__hugepd(p4d_val(*p4d)))) {
-               page = follow_huge_pd(vma, address,
-                                     __hugepd(p4d_val(*p4d)), flags,
-                                     P4D_SHIFT);
-               if (page)
-                       return page;
-               return no_page_table(vma, flags);
-       }
        return follow_pud_mask(vma, address, p4d, flags, ctx);
 }

@@ -813,15 +790,6 @@ static struct page *follow_page_mask(struct 
vm_area_struct *vma,
        if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
                return no_page_table(vma, flags);

-       if (is_hugepd(__hugepd(pgd_val(*pgd)))) {
-               page = follow_huge_pd(vma, address,
-                                     __hugepd(pgd_val(*pgd)), flags,
-                                     PGDIR_SHIFT);
-               if (page)
-                       return page;
-               return no_page_table(vma, flags);
-       }
-
        return follow_p4d_mask(vma, address, pgd, flags, ctx);
 }

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2c107e7ebd66..848b4fb7a05d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6118,6 +6118,81 @@ static inline bool 
__follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
        return false;
 }

+static struct page *hugetlb_follow_hugepd(struct vm_area_struct *vma,
+                                         unsigned long address,
+                                         unsigned int flags)
+{
+       struct mm_struct *mm = vma->vm_mm;
+       struct page *page;
+       pgd_t *pgd;
+       p4d_t *p4d;
+       pud_t *pud;
+       pmd_t *pmd;
+
+       pgd = pgd_offset(mm, address);
+       if (pgd_none(*pgd) || pgd_bad(*pgd))
+               return ERR_PTR(-EFAULT);
+
+       if (pgd_huge(*pgd))
+               return NULL;
+
+       if (is_hugepd(__hugepd(pgd_val(*pgd)))) {
+               page = follow_huge_pd(vma, address,
+                                     __hugepd(pgd_val(*pgd)), flags,
+                                     PGDIR_SHIFT);
+               if (page)
+                       return page;
+               return ERR_PTR(-EFAULT);
+       }
+
+       p4d = p4d_offset(pgd, address);
+       if (p4d_none(*p4d) || p4d_bad(*p4d))
+               return ERR_PTR(-EFAULT);
+
+       if (is_hugepd(__hugepd(p4d_val(*p4d)))) {
+               page = follow_huge_pd(vma, address,
+                                     __hugepd(p4d_val(*p4d)), flags,
+                                     P4D_SHIFT);
+               if (page)
+                       return page;
+               return ERR_PTR(-EFAULT);
+       }
+
+       pud = pud_offset(p4d, address);
+       if (pud_none(*pud) || pud_bad(*pud))
+               return ERR_PTR(-EFAULT);
+
+       if (pud_huge(*pud))
+               return NULL;
+
+       if (is_hugepd(__hugepd(pud_val(*pud)))) {
+               page = follow_huge_pd(vma, address,
+                                     __hugepd(pud_val(*pud)), flags,
+                                     PUD_SHIFT);
+               if (page)
+                       return page;
+               return ERR_PTR(-EFAULT);
+       }
+
+       pmd = pmd_offset(pud, address);+       if (pmd_none(*pmd) || 
pmd_bad(*pmd))
+                return ERR_PTR(-EFAULT);
+
+       if (pmd_huge(*pmd))
+               return NULL;
+
+       if (is_hugepd(__hugepd(pmd_val(*pmd)))) {
+               page = follow_huge_pd(vma, address,
+                                     __hugepd(pmd_val(*pmd)), flags,
+                                     PMD_SHIFT);
+               if (page)
+                       return page;
+               return ERR_PTR(-EFAULT);
+       }
+
+       return NULL;
+}
+
 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
                                unsigned long address, unsigned int flags)
 {
@@ -6135,6 +6210,10 @@ struct page *hugetlb_follow_page_mask(struct 
vm_area_struct *vma,
        if (WARN_ON_ONCE(flags & FOLL_PIN))
                return NULL;

+       page = hugetlb_follow_hugepd(vma, address, flags);
+       if (page)
+               return page;
+
        pte = huge_pte_offset(mm, haddr, huge_page_size(h));
        if (!pte)
                return NULL;


  
+struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
+				unsigned long address, unsigned int flags)
+{
+	struct hstate *h = hstate_vma(vma);
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & huge_page_mask(h);
+	struct page *page = NULL;
+	spinlock_t *ptl;
+	pte_t *pte, entry;
+
+	/*
+	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
+	 * follow_hugetlb_page().
+	 */
+	if (WARN_ON_ONCE(flags & FOLL_PIN))
+		return NULL;
+
+	pte = huge_pte_offset(mm, haddr, huge_page_size(h));
+	if (!pte)
+		return NULL;
+
+retry:
+	ptl = huge_pte_lock(h, mm, pte);
+	entry = huge_ptep_get(pte);
+	if (pte_present(entry)) {
+		page = pte_page(entry);
+		/*
+		 * try_grab_page() should always succeed here, because we hold
+		 * the ptl lock and have verified pte_present().
+		 */
+		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
+			page = NULL;
+			goto out;
+		}
+	} else {
+		if (is_hugetlb_entry_migration(entry)) {
+			spin_unlock(ptl);
+			__migration_entry_wait_huge(pte, ptl);
+			goto retry;
+		}
+		/*
+		 * hwpoisoned entry is treated as no_page_table in
+		 * follow_page_mask().
+		 */
+	}
+out:
+	spin_unlock(ptl);
+	return page;


This is neat and clean enough to not reuse follow_hugetlb_page(). I
wonder if we want to add some comment to the function how this differs
to follow_hugetlb_page().

... or do we maybe want to rename follow_hugetlb_page() to someting like
__hugetlb_get_user_pages() to make it clearer in which context it will
get called?

Sounds reasonable to me.

I guess it might be feasible in the future to eliminate
follow_hugetlb_page() and centralizing the faulting code. For now, this
certainly improves the situation.