On Thu, Jun 6, 2024 at 1:04 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > Right, so we ignore hugetlb_fault() and call into __handle_mm_fault(). > Once there, we'll do: > > vmf.pud = pud_alloc(mm, p4d, address); > if (pud_none(*vmf.pud) && > thp_vma_allowable_order(vma, vm_flags, > TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { > ret = create_huge_pud(&vmf); > > which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER); > > So all we need to do is implement huge_fault in hugetlb_vm_ops. I > don't think that's the same as creating a hugetlbfs2 because it's just > another entry point. You can mmap() the same file both ways and it's > all cache coherent. That makes a lot of sense. FWIW, this sounds good to me (though I'm curious what Peter thinks :)). But I think you'll need to be careful to ensure that, for now anyway, huge_fault() is always called with the exact same ptep/pmdp/pudp that hugetlb_walk() would have returned (ignoring sharing). If you allow PMD mapping of what would otherwise be PUD-mapped hugetlb pages right now, you'll break the vmemmap optimization (and probably other things). Also I'm not sure how this will interact with arm64's hugetlb pages implemented with contiguous PTEs/PMDs. You might have to round `address` down to make sure you've picked the first PTE/PMD in the group.