[ The arch list is going to be missing some of the emails in this thread, but they are all on lore: https://lore.kernel.org/linux-mm/20200727184239.GA21230@gaia/ and I think the context is probably sufficient even without that. ] On Mon, Jul 27, 2020 at 11:42 AM Catalin Marinas <catalin.marinas@xxxxxxx> wrote: > > At least on arm64 (and arm32), old ptes are not cached in the TLB, so > there is no need to flush if the only action was to make the pte young > from old. However, this may not be the same on other architectures. > > Also not sure about races with making a pte old then young again, some > CPU could get confused. Hmm. I'd love to extend the interface (at the same time we fix the bogus traditional default) to show what the old pte state was, but the whole point is that we don't even know. It is, by definition, gone. We got a fault for something that is no longer the case, and we didn't modify anything in the page tables. And so we know that the TLB didn't - at the time of the fault - match what we now see in the page tables. So we don't even know if we "made the pte young from old". Somebody *else* did that part. Maybe two CPU's both hit the HW page table walk at roughly the same time, both saw and old entry and triggered a sw fault, one CPU then won the race to the page table spinlock, and marked it young. And then the other CPU comes along, says "ok, nothing seems to have changed, it's a spurious fault as far as I can tell, now what?" It *could* be that "old->young" transition. But it *could* also have been that the other CPU did a write access and turned things writable (in addition to turning it young). We don't even know. So if arm doesn't cache old ptes in the TLB, then I guess for ARM, the "only try to flush for write faults" is fine, because the only possible stale bit in the TLB would be the dirty bit. And _maybe_ that ends up being true on other architectures too. But it does sound very very dodgy. Maybe we should just pass in the fault information we have (ie just give flush_tlb_fix_spurious_fault() the whole vmf pointer), and then the architecture can make their own decision based on that. So if the architecture can say "the only case that might be cached is a non-dirty old PTE that I need to flush now because it's a write fault, and not flushing it would possibly cause an endless loop", then that test for if (vmf->flags & FAULT_FLAG_WRITE) is the right thing. NOTE! The vmf does have a bit that is called "orig_pte", and has the comment "Value of PTE at the time of fault" associated with it. That comment is bogus. We don't _really_ know what the original pte was, and that "orig_pte" is just the one we loaded fairly early, and before we took the page table lock. We've made decisions based on the value, but we've also already checked that after taking the page table lock, the pte still matches. So that vmf structure may be less useful than you'd think. The only really useful information in there is likely just the address and the fault flags. Even the vma is by definition not really useful. The vma we looked up may not be the same vma that the original hardware fault happened with. If we took a fault, and some other CPU got around to do a mmap() before we got the mmap semaphore in the fault handler, we'll have the *new* vma, but the spurious fault might have come from another source entirely. But again - any *serious* page table updates we should have synchronized against, and the other CPU will have done the TLB shootdown etc. So we shouldn't need to do anything. The only thing that matters is the trivial bits that _may_ have been changed without bothering with a cross-CPU TLB flush. So it's likely only dirty/accessed bits. But I really do have this strong memory of us at least at some point deciding that we can avoid it for some other "this operation only ever _adds_ permissions, doesn't take them away" case. I can't find that code, though, so it might be either early-onset Alzheimer's, or some historical footnote that just isn't true any longer. That said, I *can* find places where we delay TLB flushes a _lot_. So another CPU may be modifying the page tables, and the flush happens much much later. For example: look at fork(). We'll mark the source page table as being read-only for COW purposes, but we'll delay the actual TLB flush to long long after we did so (but we'll do so with the mmap lock held for writing to protect against stack growing). So it's not even like the page table lock really synchronizes the page table changes with the faults and the TLB flush. The mmap lock for writing may do that too. So there's a fairly large window for these "spurious" faults, where the fault may have happened relatively much earlier, and things have changed a *lot* by the time we actually got all our locks, and saw "hey, I see nothing to change in the page tables, the fault is spurious". Linus