Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Thu, 19 Jan 2023 14:00:32 -0800

On 01/19/23 11:42, James Houghton wrote:
> On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
> > On 01/19/23 08:57, James Houghton wrote:
> >
> > OK, so we do not actually create HGM mappings until a uffd operation is
> > done at a less than huge page size granularity.  MADV_SPLIT just says
> > that HGM mappings are 'possible' for this vma.  Hopefully, my understanding
> > is correct.
> 
> Right, that's the current meaning of MADV_SPLIT for hugetlb.
> 
> > I was concerned about things like the page fault path, but in that case
> > we have already 'entered HGM mode' via a uffd operation.
> >
> > Both David and Peter have asked whether eliminating intermediate mapping
> > levels would be a simplification.  I trust your response that it would
> > not help much in the current design/implementation.  But, it did get me
> > thinking about something else.
> >
> > Perhaps we have discussed this before, and perhaps it does not meet all
> > user needs, but one way possibly simplify this is:
> >
> > - 'Enable HGM' via MADV_SPLIT.  Must be done at huge page (hstate)
> >   granularity.
> > - MADV_SPLIT implicitly unmaps everything with in the range.
> > - MADV_SPLIT says all mappings for this vma will now be done at a base
> >   (4K) page size granularity.  vma would be marked some way.
> > - I think this eliminates the need for hugetlb_pte's as we KNOW the
> >   mapping size.
> > - We still use huge pages to back 4K mappings, and we still have to deal
> >   with the ref/map_count issues.
> > - Code touching hugetlb page tables would KNOW the mapping size up front.
> >
> > Again, apologies if we talked about and previously dismissed this type
> > of approach.
> 
> I think Peter was the one who originally suggested an approach like
> this, and it meets my needs. However, I still think the way that
> things are currently implemented is the right way to go.
> 
> Assuming we want decent performance, we can't get away with the same
> strategy of just passing pte_t*s everywhere. The PTL for a 4K PTE
> should be based on the PMD above the PTE, so we need to either pass
> around the PMD above our PTE, or we need to pass around the PTL. This
> is something that hugetlb_pte does for us, so, in some sense, even
> going with this simpler approach, we still need a hugetlb_pte-like
> construct.

Agree there is this performance hit.  However, the 'simplest' approach
would be to just use the page table lock as is done by default for 4K
PTEs.

I do not know much about the (primary) live migration use case.  My
guess is that page table lock contention may be an issue?  In this use
case, HGM is only enabled for the duration the live migration operation,
then a MADV_COLLAPSE is performed.  If contention is likely to be an
issue during this time, then yes we would need to pass around with
something like hugetlb_pte.

> Although most of the other complexity that HGM introduces would have
> to be introduced either way (like having to deal with putting
> compound_head()/page_folio() in more places and doing some
> per-architecture updates), there are some complexities that the
> simpler approach avoids:
> 
> - We avoid problems related to compound PTEs (the problem being: two
> threads racing to populate a contiguous and non-contiguous PTE that
> take up the same space could lead to user-detectable incorrect
> behavior. This isn't hard to fix; it will be when I send the arm64
> patches up.)
> 
> - We don't need to check if PTEs get split from under us in PT walks.
> (In a lot of cases, the appropriate action is just to treat the PTE as
> if it were pte_none().) In the same vein, we don't need
> hugetlb_pte_present_leaf() at all, because PTEs we find will always be
> leaves.
> 
> - We don't have to deal with sorting hstates or implementing
> for_each_hgm_shift()/hugetlb_alloc_largest_pte().
> 
> None of these complexities are particularly major in my opinion.

Perhaps not.  I was just thinking about the overall complexity of the
hugetlb code after HGM.  Currently, it is 'relatively simple' with
fixed huge page sizes.  IMO, much simpler than THP with two possible
mapping sizes.  With HGM and intermediate mapping sizes, it seems
things could get more complicated than THP.  Perhaps it is just me.
I am just too familiar with the current code and a bit anxious about
added complexity.  But, I felt the same about vmemmap optimizations. :)

> This might seem kind of contrived, but let's say you have a VM with 1T
> of memory, and you find 100 memory errors all in different 1G pages
> over the life of this VM (years, potentially). Having 10% of your
> memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
> (lost performance and increased memory overhead). There might be other
> cases in the future where being able to have intermediate mapping
> sizes could be helpful.

That may be a bit contrived.  We know memory error handling is a future
use case, but I believe there is work outside of HGM than needs to be
done to handle such situations.  For example, HGM will allow the 1G
mapping to isolate the 4K page with error.  This prevents errors if you
fault almost anywhere within the 1G page.  But, there still remains the
possibility of accessing that 4K page page with error.  IMO, it will
require user space/application intervention to address this as only the
application knows about the potentially lost data.  This is still something
that needs to be designed.  It would then makes sense for the application
to also determine how it wants to proceed WRT mapping the 1G area.
Perhaps they will want (and there will exist a mechanism) to migrate the
data to a new 1G page without error.

> > > > I think Peter mentioned it elsewhere, we should come up with a workable
> > > > scheme for HGM ref/map counting.  This can be done somewhat independently.
> > >
> > > FWIW, what makes the most sense to me right now is to implement the
> > > THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> > > optimization. We can later come up with a scheme that lets us retain
> > > compatibility. (Is that what you mean by "this can be done somewhat
> > > independently", Mike?)
> >
> > Sort of, I was only saying that getting the ref/map counting right seems
> > like a task than can be independently worked.  Using the THP-like scheme
> > is good.
> 
> Ok! So if you're ok with the intermediate mapping sizes, it sounds
> like I should go ahead and implement the THP-like scheme.

Yes, I am OK with it.  Just expressed a bit of concern above.
-- 
Mike Kravetz