On Tue, Dec 13, 2022 at 3:18 PM James Houghton <jthoughton@xxxxxxxxxx> wrote: > > On Tue, Dec 13, 2022 at 2:32 PM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > > > On 10/21/22 16:36, James Houghton wrote: > > > These functions are used to allocate new PTEs below the hstate PTE. This > > > will be used by hugetlb_walk_step, which implements stepping forwards in > > > a HugeTLB high-granularity page table walk. > > > > > > The reasons that we don't use the standard pmd_alloc/pte_alloc* > > > functions are: > > > 1) This prevents us from accidentally overwriting swap entries or > > > attempting to use swap entries as present non-leaf PTEs (see > > > pmd_alloc(); we assume that !pte_none means pte_present and > > > non-leaf). > > > 2) Locking hugetlb PTEs can different than regular PTEs. (Although, as > > > implemented right now, locking is the same.) > > > 3) We can maintain compatibility with CONFIG_HIGHPTE. That is, HugeTLB > > > HGM won't use HIGHPTE, but the kernel can still be built with it, > > > and other mm code will use it. > > > > > > When GENERAL_HUGETLB supports P4D-based hugepages, we will need to > > > implement hugetlb_pud_alloc to implement hugetlb_walk_step. > > > > > > Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx> > > > --- > > > include/linux/hugetlb.h | 5 +++ > > > mm/hugetlb.c | 94 +++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 99 insertions(+) > > > > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > > index d30322108b34..003255b0e40f 100644 > > > --- a/include/linux/hugetlb.h > > > +++ b/include/linux/hugetlb.h > > > @@ -119,6 +119,11 @@ void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src) > > > > > > bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte); > > > > > > +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte, > > > + unsigned long addr); > > > +pte_t *hugetlb_pte_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte, > > > + unsigned long addr); > > > + > > > struct hugepage_subpool { > > > spinlock_t lock; > > > long count; > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index a0e46d35dabc..e3733388adee 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -341,6 +341,100 @@ static bool has_same_uncharge_info(struct file_region *rg, > > > #endif > > > } > > > > > > +pmd_t *hugetlb_pmd_alloc(struct mm_struct *mm, struct hugetlb_pte *hpte, > > > + unsigned long addr) > > > > A little confused as there are no users yet ... Is hpte the 'hstate PTE' > > that we are trying to allocate ptes under? For example, in the case of > > a hugetlb_pmd_alloc caller hpte would be a PUD or CONT_PMD size pte? > > The hpte is the level above the level we're trying to allocate (not > necessarily the 'hstate PTE'). I'll make that clear in the comments > for both functions. > > So consider allocating 4K PTEs for a 1G HugeTLB page: > - With the hstate 'PTE' (PUD), we make a hugetlb_pte with that PUD > (let's call it 'hpte') > - We call hugetlb_pmd_alloc(hpte) which will leave 'hpte' the same, > but the pud_t that hpte->ptep points to is no longer a leaf. > - We call hugetlb_walk_step(hpte) to step down a level to get a PMD, > changing hpte. The hpte->ptep is now pointing to a blank pmd_t. > - We call hugetlb_pte_alloc(hpte) to allocate a bunch of PTEs and > populate the pmd_t. > - We call hugetlb_walk_step(hpte) to step down again. Erm actually this isn't entirely accurate. The general flow is about right, but hugetlb_pmd_alloc/hugetlb_pte_alloc are actually part of hugetlb_walk_step. (See hugetlb_hgm_walk for the ground truth :P) - James > > This is basically what hugetlb_hgm_walk does (in the next patch). We > only change 'hpte' when we do a step, and that is when we populate > 'shift'. The 'sz' parameter for hugetlb_walk_step is what > architectures can use to populate hpte->shift appropriately (ignored > for x86). > > For arm64, we can use 'sz' to populate hpte->shift with what the > caller wants when we are free to choose (like if all the PTEs are > none, we can do CONT_PTE_SHIFT). See [1]'s implementation of > hugetlb_walk_step for what I *think* is correct for arm64. > > [1] https://github.com/48ca/linux/commit/bf3b8742e95c58c2431c80c5bed5cb5cb95885af > > > > > > +{ > > > + spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte); > > > + pmd_t *new; > > > + pud_t *pudp; > > > + pud_t pud; > > > + > > > + if (hpte->level != HUGETLB_LEVEL_PUD) > > > + return ERR_PTR(-EINVAL); > > > > Ah yes, it is PUD level. However, I guess CONT_PMD would also be valid > > on arm64? > > The level is always PGD, P4D, PUD, PMD, or PTE. CONT_PTE is on > HUGETLB_LEVEL_PTE, CONT_PMD is on HUGETLB_LEVEL_PMD. > > These functions are supposed to be used for all architectures (in > their implementations of 'hugetlb_walk_step'; that's why they're not > static, actually. I'll make that clear in the commit description). > > > > > > + > > > + pudp = (pud_t *)hpte->ptep; > > > +retry: > > > + pud = *pudp; > > > > We might want to consider a READ_ONCE here. I am not an expert on such > > things, but recall a similar as pointed out in the now obsolete commit > > 27ceae9833843. > > Agreed. Will try to change all PTE reading to use READ_ONCE, though > they can be easy to miss... :( > > Thanks very much for the reviews so far, Mike! > > - James > > > > > -- > > Mike Kravetz > >