On Oct 30, 2022, at 11:19 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > And page_remove_rmap() could *almost* be called later, but it does > have code that also depends on the page table lock, although it looks > like realistically that's just because it "knows" that means that > preemption is disabled, so it uses non-atomic statistics update. > > I say "knows" in quotes, because that's what the comment says, but it > turns out that __mod_node_page_state() has to deal with CONFIG_RT > anyway and does that > > preempt_disable_nested(); > ... > preempt_enable_nested(); > > thing. > > And then it wants to see the vma, although that's actually only to see > if it's 'mlock'ed, so we could just squirrel that away. > > So we *could* move page_remove_rmap() later into the TLB flush region, > but then we would have lost the page table lock anyway, so then > folio_mkclean() can come in regardless. > > So that doesn't even help. Well, if you combine it with the per-page-table stale TLB detection mechanism that I proposed, I think this could work. Reminder (feel free to skip): you would have per-mm “completed TLB-generation” in addition to the current one, which would be renamed to “pending TLB-generation”. Whenever you update the page-tables in a manner that might require a TLB flush, you would increase the “pending TLB-generation” and save the pending TLB-generation in the page-table’s page-struct. All of that is done once under the page-table lock. When you finish a TLB-flush, you update the “completed TLB-generation”. Then on page_vma_mkclean_one(), you would check if the page-table’s TLB-generation is greater than the completed TLB-generation, which would indicate that TLB entries for PTEs in this table might be stale. In that case you would just flush the TLB. [ Of course you can instead just flush if mm_tlb_flush_pending(), but nobody likes this mechanism that has a very coarse granularity, and therefore can lead to many unnecessary TLB flushes. ] Indeed, there would be potentially some overhead in extreme cases, since mm's TLB-generation since its cache is already highly-contended in extreme cases. But I think it worth it to have simple logic that allows to reason about correctness. My intuition is that although you appear to be right that we can just mark this case as “extreme case nobody cares about”, it might have now or in the future some other implications that are hard to predict and prevent.