On 10/21/22 16:36, James Houghton wrote: > These functions are used to allocate new PTEs below the hstate PTE. This > will be used by hugetlb_walk_step, which implements stepping forwards in > a HugeTLB high-granularity page table walk. > > The reasons that we don't use the standard pmd_alloc/pte_alloc* > functions are: > 1) This prevents us from accidentally overwriting swap entries or > attempting to use swap entries as present non-leaf PTEs (see > pmd_alloc(); we assume that !pte_none means pte_present and > non-leaf). > 2) Locking hugetlb PTEs can different than regular PTEs. (Although, as > implemented right now, locking is the same.) > 3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB > HGM won't use HIGHPTE, but the kernel can still be built with it, > and other mm code will use it. > > When GENERAL_HUGETLB supports P4D-based hugepages, we will need to > implement hugetlb_pud_alloc to implement hugetlb_walk_step. > > Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx> > --- > include/linux/hugetlb.h | 5 +++ > mm/hugetlb.c | 94 +++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 99 insertions(+) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index d30322108b34..003255b0e40f 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -119,6 +119,11 @@ void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src) > > bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte); > > +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte, > + unsigned long addr); > +pte_t *hugetlb_pte_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte, > + unsigned long addr); > + > struct hugepage_subpool { > spinlock_t lock; > long count; > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index a0e46d35dabc..e3733388adee 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -341,6 +341,100 @@ static bool has_same_uncharge_info(struct file_region *rg, > #endif > } > > +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte, > + unsigned long addr) A little confused as there are no users yet ... Is hpte the 'hstate PTE' that we are trying to allocate ptes under? For example, in the case of a hugetlb_pmd_alloc caller hpte would be a PUD or CONT_PMD size pte? > +{ > + spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte); > + pmd_t *new; > + pud_t *pudp; > + pud_t pud; > + > + if (hpte->level != HUGETLB_LEVEL_PUD) > + return ERR_PTR(-EINVAL); Ah yes, it is PUD level. However, I guess CONT_PMD would also be valid on arm64? > + > + pudp = (pud_t *)hpte->ptep; > +retry: > + pud = *pudp; We might want to consider a READ_ONCE here. I am not an expert on such things, but recall a similar as pointed out in the now obsolete commit 27ceae9833843. -- Mike Kravetz > + if (likely(pud_present(pud))) > + return unlikely(pud_leaf(pud)) > + ? ERR_PTR(-EEXIST) > + : pmd_offset(pudp, addr); > + else if (!huge_pte_none(huge_ptep_get(hpte->ptep))) > + /* > + * Not present and not none means that a swap entry lives here, > + * and we can't get rid of it. > + */ > + return ERR_PTR(-EEXIST); > + > + new = pmd_alloc_one(mm, addr); > + if (!new) > + return ERR_PTR(-ENOMEM); > + > + spin_lock(ptl); > + if (!pud_same(pud, *pudp)) { > + spin_unlock(ptl); > + pmd_free(mm, new); > + goto retry; > + } > + > + mm_inc_nr_pmds(mm); > + smp_wmb(); /* See comment in pmd_install() */ > + pud_populate(mm, pudp, new); > + spin_unlock(ptl); > + return pmd_offset(pudp, addr); > +} > + > +pte_t *hugetlb_pte_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte, > + unsigned long addr) > +{ > + spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte); > + pgtable_t new; > + pmd_t *pmdp; > + pmd_t pmd; > + > + if (hpte->level != HUGETLB_LEVEL_PMD) > + return ERR_PTR(-EINVAL); > + > + pmdp = (pmd_t *)hpte->ptep; > +retry: > + pmd = *pmdp; > + if (likely(pmd_present(pmd))) > + return unlikely(pmd_leaf(pmd)) > + ? ERR_PTR(-EEXIST) > + : pte_offset_kernel(pmdp, addr); > + else if (!huge_pte_none(huge_ptep_get(hpte->ptep))) > + /* > + * Not present and not none means that a swap entry lives here, > + * and we can't get rid of it. > + */ > + return ERR_PTR(-EEXIST); > + > + /* > + * With CONFIG_HIGHPTE, calling `pte_alloc_one` directly may result > + * in page tables being allocated in high memory, needing a kmap to > + * access. Instead, we call __pte_alloc_one directly with > + * GFP_PGTABLE_USER to prevent these PTEs being allocated in high > + * memory. > + */ > + new = __pte_alloc_one(mm, GFP_PGTABLE_USER); > + if (!new) > + return ERR_PTR(-ENOMEM); > + > + spin_lock(ptl); > + if (!pmd_same(pmd, *pmdp)) { > + spin_unlock(ptl); > + pgtable_pte_page_dtor(new); > + __free_page(new); > + goto retry; > + } > + > + mm_inc_nr_ptes(mm); > + smp_wmb(); /* See comment in pmd_install() */ > + pmd_populate(mm, pmdp, new); > + spin_unlock(ptl); > + return pte_offset_kernel(pmdp, addr); > +} > + > static void coalesce_file_region(struct resv_map *resv, struct file_region *rg) > { > struct file_region *nrg, *prg; > -- > 2.38.0.135.g90850a2211-goog >