Matthew Wilcox wrote on Thu, Sep 05, 2019: > On Thu, Sep 05, 2019 at 05:44:00PM +0200, Dominique Martinet wrote: > > Question though - is it ok to insert small pages if the huge_fault > > handler is called with PE_SIZE_PMD ? > > (I think the pte insertion will automatically create the pmd, but would > > be good to confirm) > > No, you need to return VM_FAULT_FALLBACK, at which point the generic code > will create a PMD for you and then call your ->fault handler which can > insert PTEs. Hmm, that's a shame actually. There is a rather costly round-trip between linux and mckernel to determine what page size is used for this virtual address on the remote side and to get the corresponding physical address, so basically when we get the fault we do know know if this will be a PMD or PTE. I'd rather avoid having to do one round-trip at the PMD stage, get told this is a PTE, temporarily give up and wait to be called again with PE_SIZE_PTE and do a second round-trip in this case. I didn't see anywhere in the vm_fault struct that I could piggy-back to remember something from the previous call, and I'm pretty sure it would be a bad idea to use the vma's vm_private_data here because there could be multiple faults in parallel on other threads. Looking at vmf_insert_pfn(), it will allocate a pmd because of insert_pfn's get_locked_pte, so it does end up working (I never return a page - we always return VM_FAULT_NOPAGE on success, so I do not see the harm in doing it early if we can) Following the code in __handle_vm_fault assuming the pmd fault would have returned fallback I do not see any harm here - the pmd actually already has been allocated here (at pmd level fault), it's just set to none. Not exactly pretty, though, and very definitely no guarantee it'll keep working... I'll stick a comment saying what we should do at least :P > It works the same way from PUDs to PMDs by the way, in case you ever > have a 1GB mapping ;-) Yes, already returning fallback in this case - but I'm just assuming that won't happen so no round-trip here :) > > Now that I've set it as dax I think it actually makes sense as in > > "there's memory here that points to something linux no longer manages > > directly, just let it be" and we might benefit from the other exceptions > > dax have, I'll need to look at what this implies in more details... > > I think that should be fine, but I don't really know RHEL 7.3 all that > well ;-) Good enough for me, tests will tell me what I broke :) > No problem ... these APIs are relatively new and not necessarily all > that intuitive. Looking at a recent vanilla linux on evening and rhel's kernel at work didn't help on my side (some fun differences like the VM_HUGE_FAULT flag in the vma, but now I understand it was added for abi compatibility it does make sense after I found about it - on an older module the function could just have been left uninitialized and thus non-null yet not valid) Definitely did help to point at huge_fault() again. Thanks, -- Dominique