On Thu, Jun 6, 2024 at 2:21 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Thu, Jun 06, 2024 at 01:23:08PM -0700, James Houghton wrote: > > On Thu, Jun 6, 2024 at 1:04 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > Right, so we ignore hugetlb_fault() and call into __handle_mm_fault(). > > > Once there, we'll do: > > > > > > vmf.pud = pud_alloc(mm, p4d, address); > > > if (pud_none(*vmf.pud) && > > > thp_vma_allowable_order(vma, vm_flags, > > > TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { > > > ret = create_huge_pud(&vmf); > > > > > > which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER); > > > > > > So all we need to do is implement huge_fault in hugetlb_vm_ops. I > > > don't think that's the same as creating a hugetlbfs2 because it's just > > > another entry point. You can mmap() the same file both ways and it's > > > all cache coherent. > > > > That makes a lot of sense. FWIW, this sounds good to me (though I'm > > curious what Peter thinks :)). > > > > But I think you'll need to be careful to ensure that, for now anyway, > > huge_fault() is always called with the exact same ptep/pmdp/pudp that > > hugetlb_walk() would have returned (ignoring sharing). If you allow > > PMD mapping of what would otherwise be PUD-mapped hugetlb pages right > > now, you'll break the vmemmap optimization (and probably other > > things). > > Why is that? This sounds like you know something I don't ;-) > Is it the mapcount issue? Yeah, that's what I was thinking about. But I guess whether or not you are compatible with the vmemmap optimization depends on what mapcounting scheme you use. If you just use the THP one, you're going to end up incrementing _mapcount on the subpages, and that won't work. I'm not immediately thinking of other things that would break... need to think some more.