On Wed, Sep 22, 2021 at 01:49:42PM -0700, Axel Rasmussen wrote: > Sorry for missing the other thread. Heh, you didn't miss the other thread; I just posted both of the emails in a few hours. :) > > Unfortunately, I think shmem THP *doesn't* really work with minor > faults, and what's worse, just checking the VMA flag isn't enough. As I replied to myself (sorry to have done that), now I think minor mode is fine, but let's see what else I've missed, which is possible... Please see below. > > First, let me note the guarantee UFFD minor faults are trying to > provide: for a given mapping, any minor fault (that is, pte_none() but > a page is present in the page cache) must result in a minor userfault > event. Furthermore, the only way the fault may be resolved (i.e., a > PTE installed) is via a UFFDIO_CONTINUE ioctl from userspace. Yes. > > A typical use case for minor faults is, we have two mappings (i.e., > two VMAs), both pointing to the same underlying physical memory. It's > typical for both to have MAP_SHARED. It's typical for one of these > mappings to be fully faulted in (i.e., all of its PTEs exist), while > the other one has some missing PTEs. The problem is, khugepaged might > scan *either* of the two mappings. Say it picks the fully-faulted VMA: > even if we set khugepaged_max_ptes_none to zero, it will still go > ahead and collapse these pages - because *this* VMA has no missing > PTEs. Yes. > > Why is this a problem? When we collapse, we install a PMD, for *all* > VMAs which reference these pages. In other words, we might install > PTEs for the other, minor-fault-registered mapping, and therefore > userfaults will never trigger for some of those regions, even though > userspace never UFFDIO_CONTINUE-ed them. Nop - we don't install PMD for file-backed, do we? Please see khugepaged_scan_pmd() - that one installs PMDs indeed, but it's anonymous-only code. Then please also see khugepaged_scan_file() - that one handles file-backed (aka, shmem), and it does _not_ install pmd, afaict. The installation is lazy. Not installing pmd means uffd-minor can still trap any further faults just like before, afaiu. There's a very trivial detail that the pmd missing case will have a very slight code path change when the next page fault happens: in __handle_mm_fault() we'll first try to go into create_huge_pmd() once, however since shmem didn't provide huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like before when faulting on a small pte. The next UFFDIO_CONTINUE will allocate that missing pmd again, however it'll install a 4K page only. > > I *think* the right place to check for this and solve it is in > retract_page_tables(), and I have a patch which does this. I've been > hesitant to send it though, as due to a lack of time and the > complexity involved I haven't been able to write a clear reproducer > program, which my patch clearly fixes. :/ Yes retract_page_tables() could drop pmd pgtable for minor fault, but IMHO it's fine too as mentioned above. Minor mode should only care about trapping the page fault when the next access comes. retract_page_tables() will wipe the pmd pgtable page, that's not fine for uffd-wp, but IMHO that's still very fine for minor mode as it will keep trapping the old missing ptes; the difference is it'll just generate even more traps (rather than on the pte holes only, now it'll generate one message for each 4k over the merged 2M). As I mentioned in the other thread, I think that'll cause false positive minor fault messages, but IMHO that's fine, and minor fault userspace should always need to handle that. I fully agree with you that a reproducer would be very nice to try. So if my understanding is correct, the reproducer won't really fail on minor mode in any way, but it'll just need to be prepared to receive more messages than it should. Thanks, -- Peter Xu