Re: Unifying page table walkers

Peter Xu <peterx@xxxxxxxxxx> · Thu, 6 Jun 2024 17:33:31 -0400

On Thu, Jun 06, 2024 at 09:04:53PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 06, 2024 at 12:30:44PM -0700, James Houghton wrote:
> > Today the VM_HUGETLB flag tells the fault handler to call into
> > hugetlb_fault() (there are many other special cases, but this one is
> > probably the most important). How should faults on VMAs without
> > VM_HUGETLB that map HugeTLB folios be handled? If you handle faults
> > with the main mm fault handler without getting rid of hugetlb_fault(),
> > I think you're basically implementing a second, more tmpfs-like
> > hugetlbfs... right?
> > 
> > I don't really have anything against this approach, but I think the
> > decision was to reduce the number of special cases as much as we can
> > first before attempting to rewrite hugetlbfs.
> > 
> > Or maybe I've got something wrong and what you're asking doesn't
> > logically end up at a hugetlbfs v2.
> 
> Right, so we ignore hugetlb_fault() and call into __handle_mm_fault().
> Once there, we'll do:
> 
>         vmf.pud = pud_alloc(mm, p4d, address);
>         if (pud_none(*vmf.pud) &&
>             thp_vma_allowable_order(vma, vm_flags,
>                                 TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
>                 ret = create_huge_pud(&vmf);
> 
> which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER);
> 
> So all we need to do is implement huge_fault in hugetlb_vm_ops.  I
> don't think that's the same as creating a hugetlbfs2 because it's just
> another entry point.  You can mmap() the same file both ways and it's
> all cache coherent.

Matthew, could you elaborate more on how hugetlb_vm_ops.huge_fault() could
start to inject hugetlb pages without a hugetlb VMA?  I meant, at least
currently what I read is this, where we define a hugetlb VMA always as:

	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
	vma->vm_ops = &hugetlb_vm_ops;

So any vma that uses hugetlb_vm_ops will have VM_HUGETLB set for sure..

If you're talking about some other VMAs, it sounds to me like that this
huge_fault() should belong to that new VMA's vm_ops? Then it sounds like a
way some non-hugetlb VMA wants to reuse the hugetlb allocator/pool of the
huge pages.  I'm not sure I understand it right, though..

Regarding to the 4k mapping plan on hugetlb..  I talked to Michal, Yu and
some other people when during lsfmm, and I think so far it seems to me the
best way to go is to allow shmem provide 1G pages.

Again, IMHO I'd be totally fine if we finish the cleanup but just add HGM
on top of hugetlbv1, but it looks like I'm the only person who thinks like
that..  If we can introduce 1G to shmem then as long as the 1G pages can be
as stable as a hugetlbv1 1G page then it's good enough for a VM use case
(which I care, and I believe also to most cloud providers who cares about
postcopy) and then adding 4k mapping on top of that can avoid all the
hugetlb concerns people have too (even though I think most of the logic
that HGM wants will still be there).

That also kind of matches with the TAO's design where we may have more
chance having THPs allocated even dynamically on 1G, however the sake here
for VM context is we'll want reliable 1G not split-able ones.  We may want
it split-able only on pgtable but not the folios.

However I don't yet think any of them are solid ideas.  It might be
interesting to know how your thoughts correlate to this too, since I think
you mentioned the 4k mapping somewhere.  I'm also making bold to copy
relevant people just in case it could be relevant discussion.

[PS: will need to be off work tomorrow, so please expect a delay on follow
 up emails..]

Thanks,

-- 
Peter Xu