> On Dec 22, 2020, at 12:34 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > On Sat, Dec 19, 2020 at 2:06 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote: >>> [ I have in mind another solution, such as keeping in each page-table a >>> “table-generation” which is the mm-generation at the time of the change, >>> and only flush if “table-generation”==“mm-generation”, but it requires >>> some thought on how to avoid adding new memory barriers. ] >>> >>> IOW: I think the change that you suggest is insufficient, and a proper >>> solution is too intrusive for “stable". >>> >>> As for performance, I can add another patch later to remove the TLB flush >>> that is unnecessarily performed during change_protection_range() that does >>> permission promotion. I know that your concern is about the “protect” case >>> but I cannot think of a good immediate solution that avoids taking mmap_lock >>> for write. >>> >>> Thoughts? >> >> On a second thought (i.e., I don’t know what I was thinking), doing so — >> checking mm_tlb_flush_pending() on every PTE read which is potentially >> dangerous and flushing if needed - can lead to huge amount of TLB flushes >> and shootodowns as the counter might be elevated for considerable amount of >> time. > > I've lost track as to whether we still think that this particular > problem is really a problem, If you mean “problem” as to whether there is a correctness issue with userfaultfd and soft-dirty deferred flushes under mmap_read_lock() - yes there is a problem and I produced these failures on upstream. If you mean “problem” as to performance - I do not know. > but could we perhaps make the > tlb_flush_pending field be per-ptl instead of per-mm? Depending on > how it gets used, it could plausibly be done without atomics or > expensive barriers by using PTL to protect the field. > > FWIW, x86 has a mm generation counter, and I don't think it would be > totally crazy to find a way to expose an mm generation to core code. > I don't think we'd want to expose the specific data structures that > x86 uses to track it -- they're very tailored to the oddities of x86 > TLB management. x86 also doesn't currently have any global concept of > which mm generation is guaranteed to have been propagated to all CPUs > -- we track the generation in the pagetables and, per cpu, the > generation that we know that CPU has seen. x86 could offer a function > "ensure that all CPUs catch up to mm generation G and don't return > until this happens" and its relative "have all CPUs caught up to mm > generation G", but these would need to look at data from multiple CPUs > and would probably be too expensive on very large systems to use in > normal page faults unless we were to cache the results somewhere. > Making a nice cache for this is surely doable, but maybe more > complexity than we'd want. I had somewhat similar ideas - saving in each page-struct the generation, which would allow to: (1) extend pte_same() to detect interim changes that were reverted (RO->RW->RO) and (2) per-PTE pending flushes. Obviously, I cannot do it as part of this fix. But moreover, it seems to me that it would require a memory barrier after updating the PTEs and before reading the current generation (that would be saved per page-table). I try to think about schemes that would use the per-CPU generation instead, but still could not and did not have the time to figure it out.