Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

Peter Xu <peterx@xxxxxxxxxx> · Thu, 12 Jan 2023 10:29:08 -0500

On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote:
> On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > James,
> >
> > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > >               int mapcount = page_mapcount(page);
> > >
> > >               if (mapcount >= 2)
> > > -                     mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > +                     mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > >               else
> > > -                     mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > +                     mss->private_hugetlb += hugetlb_pte_size(hpte);
> > >       }
> > >       return 0;
> >
> > One thing interesting I found with hgm right now is mostly everything will
> > be counted as "shared" here, I think it's because mapcount is accounted
> > always to the huge page even if mapped in smaller sizes, so page_mapcount()
> > to a small page should be huge too because the head page mapcount should be
> > huge.  I'm curious the reasons behind the mapcount decision.
> >
> > For example, would that risk overflow with head_compound_mapcount?  One 1G
> > page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> > atomic_t.  Looks like it's possible.
> 
> The original mapcount approach was "if the hstate-level PTE is
> present, increment the compound_mapcount by 1" (basically "if any of
> the hugepage is mapped, increment the compound_mapcount by 1"), but
> this was painful to implement,

Any more info here on why it was painful?  What is the major blocker?

> so I changed it to what it is now (each new PT mapping increments the
> compound_mapcount by 1). I think you're right, there is absolutely an
> overflow risk. :( I'm not sure what the best solution is. I could just go
> back to the old approach.

No rush on that; let's discuss it thoroughly before doing anything.  We
have more context than when it was discussed initially in the calls, so
maybe a good time to revisit.

> 
> Regarding when things are accounted in private_hugetlb vs.
> shared_hugetlb, HGM shouldn't change that, because it only applies to
> shared mappings (right now anyway). It seems like "private_hugetlb"
> can include cases where the page is shared but has only one mapping,
> in which case HGM will change it from "private" to "shared".

The two fields are not defined against VM_SHARED, it seems.  At least not
with current code base.

Let me quote the code again just to be clear:

		int mapcount = page_mapcount(page);    <------------- [1]

		if (mapcount >= 2)
			mss->shared_hugetlb += hugetlb_pte_size(hpte);
		else
			mss->private_hugetlb += hugetlb_pte_size(hpte);

		smaps_hugetlb_hgm_account(mss, hpte);

So that information (for some reason) is only relevant to how many mapcount
is there.  If we have one 1G page and only two 4K mapped, with the existing
logic we should see 8K private_hugetlb while in fact I think it should be
8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings
(as they all goes back to the head).

I have no idea whether violating that will be a problem or not, I suppose
at least it needs justification if it will be violated, or hopefully it can
be kept as-is.

> 
> >
> > Btw, are the small page* pointers still needed in the latest HGM design?
> > Is there code taking care of disabling of hugetlb vmemmap optimization for
> > HGM?  Or maybe it's not needed anymore for the current design?
> 
> The hugetlb vmemmap optimization can still be used with HGM, so there
> is no code to disable it. We don't need small page* pointers either,
> except for when we're dealing with mapping subpages, like in
> hugetlb_no_page. Everything else can handle the hugetlb page as a
> folio.
> 
> I hope we can keep compatibility with the vmemmap optimization while
> solving the mapcount overflow risk.

Yeh that'll be perfect if it works.  But afaiu even with your current
design (ignoring all the issues on either smaps accounting or overflow
risks), we already referenced the small pages, aren't we?  See:

static inline int page_mapcount(struct page *page)
{
	int mapcount = atomic_read(&page->_mapcount) + 1;  <-------- here

	if (likely(!PageCompound(page)))
		return mapcount;
	page = compound_head(page);
	return head_compound_mapcount(page) + mapcount;
}

Even if we assume small page->_mapcount should always be zero in this case,
we may need to take special care of hugetlb pages in page_mapcount() to not
reference it at all.  Or I think it's reading random values and some days
it can be non-zero.

The other approach is probably using the thp approach.  After Hugh's rework
on the thp accounting I assumed it would be even cleaner (at least no
DoubleMap complexity anymore.. even though I can't say I fully digested the
whole history of that).  It's all about what's the major challenges of
using the same approach there with thp.  You may have more knowledge on
that aspect so I'd be willing to know.

Thanks,

-- 
Peter Xu