Re: [RFC PATCH 1/3] hugetlb: skip to end of PT page mapping when pte not present

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Wed, 15 Jun 2022 10:27:25 -0700

On 05/31/22 10:00, Mike Kravetz wrote:
> On 5/30/22 12:56, Peter Xu wrote:
> > Hi, Mike,
> > 
> > On Fri, May 27, 2022 at 03:58:47PM -0700, Mike Kravetz wrote:
> >> +unsigned long hugetlb_mask_last_hp(struct hstate *h)
> >> +{
> >> +	unsigned long hp_size = huge_page_size(h);
> >> +
> >> +	if (hp_size == P4D_SIZE)
> >> +		return PGDIR_SIZE - P4D_SIZE;
> >> +	else if (hp_size == PUD_SIZE)
> >> +		return P4D_SIZE - PUD_SIZE;
> >> +	else if (hp_size == PMD_SIZE)
> >> +		return PUD_SIZE - PMD_SIZE;
> >> +
> >> +	return ~(0);
> >> +}
> > 
> > How about:
> > 
> > unsigned long hugetlb_mask_last_hp(struct hstate *h)
> > {
> > 	unsigned long hp_size = huge_page_size(h);
> > 
> > 	return hp_size * (PTRS_PER_PTE - 1);
> > }
> > 
> > ?

As mentioned in a followup e-mail, I am a little worried about this
calculation not being accurate for all configurations.  Today,
PTRS_PER_PTE == PTRS_PER_PMD == PTRS_PER_PUD == PTRS_PER_P4D in all
architectures that CONFIG_ARCH_WANT_GENERAL_HUGETLB.  However, if we
code things as above and that changes the bug might be hard to find.

In the next version, I will leave this as above but move to a switch
statement for better readability.

> > 
> > This is definitely a good idea, though I'm wondering the possibility to go
> > one step further to make hugetlb pgtable walk just like the normal pages.
> > 
> > Say, would it be non-trivial to bring some of huge_pte_offset() into the
> > walker functions, so that we can jump over even larger than PTRS_PER_PTE
> > entries (e.g. when p4d==NULL for 2m huge pages)?  It's very possible I
> > overlooked something, though.

I briefly looked at this.  To make it work, the walker zapping functions
such as zap_*_range would need to have a 'is_vm_hugetlb_page(vma)', and
if true use hugetlb specific page table routines instead of the generic
routines.

In many cases, the hugetlb specific page table routines are the same as
the generic routines.  But, there are a few exceptions.  IMO, it would
be better to first try to cleanup and unify those routines.  That would
make changes to the walker routines less invasive and easier to
maintain.  I believe is other code that would benefit from such a
cleanup.  Unless there are strong objections, I suggest we move forward
with the optimization here and move the cleanup and possible walker
changes to a later series.
-- 
Mike Kravetz