Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 27 Jul 2020 13:56:41 -0700

[ The arch list is going to be missing some of the emails in this
thread, but they are all on lore:

      https://lore.kernel.org/linux-mm/20200727184239.GA21230@gaia/

   and I think the context is probably sufficient even without that. ]

On Mon, Jul 27, 2020 at 11:42 AM Catalin Marinas
<catalin.marinas@xxxxxxx> wrote:
>
> At least on arm64 (and arm32), old ptes are not cached in the TLB, so
> there is no need to flush if the only action was to make the pte young
> from old. However, this may not be the same on other architectures.
>
> Also not sure about races with making a pte old then young again, some
> CPU could get confused.

Hmm. I'd love to extend the interface (at the same time we fix the
bogus traditional default) to show what the old pte state was, but the
whole point is that we don't even know.

It is, by definition, gone.

We got a fault for something that is no longer the case, and we didn't
modify anything in the page tables. And so we know that the TLB didn't
- at the time of the fault - match what we now see in the page tables.

So we don't even know if we "made the pte young from old". Somebody
*else* did that part. Maybe two CPU's both hit the HW page table walk
at roughly the same time, both saw and old entry and triggered a sw
fault, one CPU then won the race to the page table spinlock, and
marked it young.

And then the other CPU comes along, says "ok, nothing seems to have
changed, it's a spurious fault as far as I can tell, now what?"

It *could* be that "old->young" transition. But it *could* also have
been that the other CPU did a write access and turned things writable
(in addition to turning it young). We don't even know.

So if arm doesn't cache old ptes in the TLB, then I guess for ARM, the
"only try to flush for write faults" is fine, because the only
possible stale bit in the TLB would be the dirty bit.

And _maybe_ that ends up being true on other architectures too. But it
does sound very very dodgy.

Maybe we should just pass in the fault information we have (ie just
give flush_tlb_fix_spurious_fault() the whole vmf pointer), and then
the architecture can make their own decision based on that.

So if the architecture can say "the only case that might be cached is
a non-dirty old PTE that I need to flush now because it's a write
fault, and not flushing it would possibly cause an endless loop", then
that test for

        if (vmf->flags & FAULT_FLAG_WRITE)

is the right thing.

NOTE! The vmf does have a bit that is called "orig_pte", and has the
comment "Value of PTE at the time of fault" associated with it. That
comment is bogus.

We don't _really_ know what the original pte was, and that "orig_pte"
is just the one we loaded fairly early, and before we took the page
table lock. We've made decisions based on the value, but we've also
already checked that after taking the page table lock, the pte still
matches.

So that vmf structure may be less useful than you'd think. The only
really useful information in there is likely just the address and the
fault flags.

Even the vma is by definition not really useful. The vma we looked up
may not be the same vma that the original hardware fault happened
with. If we took a fault, and some other CPU got around to do a mmap()
before we got the mmap semaphore in the fault handler, we'll have the
*new* vma, but the spurious fault might have come from another source
entirely.

But again - any *serious* page table updates we should have
synchronized against, and the other CPU will have done the TLB
shootdown etc. So we shouldn't need to do anything. The only thing
that matters is the trivial bits that _may_ have been changed without
bothering with a cross-CPU TLB flush.

So it's likely only dirty/accessed bits. But I really do have this
strong memory of us at least at some point deciding that we can avoid
it for some other "this operation only ever _adds_ permissions,
doesn't take them away" case.

I can't find that code, though, so it might be either early-onset
Alzheimer's, or some historical footnote that just isn't true any
longer.

That said, I *can* find places where we delay TLB flushes a _lot_. So
another CPU may be modifying the page tables, and the flush happens
much much later.

For example: look at fork(). We'll mark the source page table as being
read-only for COW purposes, but we'll delay the actual TLB flush to
long long after we did so (but we'll do so with the mmap lock held for
writing to protect against stack growing).

So it's not even like the page table lock really synchronizes the page
table changes with the faults and the TLB flush. The mmap lock for
writing may do that too.

So there's a fairly large window for these "spurious" faults, where
the fault may have happened relatively much earlier, and things have
changed a *lot* by the time we actually got all our locks, and saw
"hey, I see nothing to change in the page tables, the fault is
spurious".

                Linus