Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 27 Jul 2020 11:37:42 -0700

[ Adding linux-arch, just to make other architectures aware of this issue too.

  We have a "flush_tlb_fix_spurious_fault()" thing to take care of the
"TLB may contain stale entries, we can't take the same fault over and
over again" situation.

  On x86, it's a no-op, because x86 doesn't do that. x86 will re-walk
the page tables - or possibly just always invalidate the faulting TLB
entry - before taking a fault, so there can be no long-term stale
TLB's.

  Other architectures may or may not need explicit "invalidate this
TLB entry, because if you make no changes to the page tables, I'll
just otherwise take this fault again. Forever". That is what
"flush_tlb_fix_spurious_fault()" does.

  NOTE! One reason for a stale TLB entry is that another CPU has
already done the change, and is just _about_ to flush the TLB, but the
hardware took the fault before it did so. The code is under the page
table lock, but the hardware fault handler doesn't know or care. So by
the time we get to "flush_tlb_fix_spurious_fault()", we _will_ have
synchronized (because we took the page table lock), and it's entirely
possible that the architecture thus has nothing to do. Make it a
no-op.

  The other reason for a stale TLB entry is if you don't do the
cross-CPU flush for "minor" events that don't matter (ie turning
things dirty, things like that). Rather than flush the TLB, you _want_
the other CPU to take the fault in the (presumabl;y unlikely) case
that it had that old TLB entry in the first place, and thought _it_
needed to do mark it dirty.

  Anyway, theres' a reason for "flush_tlb_fix_spurious_fault()", but
not all architectures need it.

  HOWEVER.

  On architectures that don't explicitly define it, it falls back to a
default of "flush_tlb_page()", which sounds sane, but in fact is
completely insane and horribly horribly wrong.

  It's completely insane and horribly wrong, because that fallback
predates the "everybody is SMP" days. On UP, it's fine and sane.

  But on SMP, it's absolutely horrendously bad. Because
flush_tlb_fix_spurious_fault() should not do any cross-CPU
invalidates.

  It looks like arm64 got this nasty performance problem because of
this all, with the cross-CPU invalidates being insanely expensive, and
completely pointless  - and easy to hit in some circumstances.

  It looks like powerpc people at least thought about this, and only
do it if there is a coprocessor. Which sounds a bit confused, but I
don't know the rules.

  It looks like a lot of others are ok mainly because they don't do
SMP, or they don't have the kinds of loads where this matters.

  But I wanted to cc the arch mailing list, to make people more aware
of it. And we *should* change the default. It shouldn't be
"flush_tlb_page()". It _should_ be "local_flush_tlb_page()", but we
don't have that, although many architectures implement something like
that as part of their SMP invalidation support ]

On Mon, Jul 27, 2020 at 11:04 AM Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> wrote:
>
> It looks Linus's patch has better data. It seems sane to me since
> Catalin's patch still needs flush TLB in the shared domain.

Well, my patch as posted never built at all, I think.

Looking back at that patch, I used FAULT_FLAG_RETRY. But that's not
the correct name for any of the bits.

So you must have fixed it. Did you make it use "FAULT_FLAG_TRIED"?
Because that's the right bit - don't flush if this is actually the
second (or more) attempt.

But I'm a bit worried that you would have used one of the other bits
(FAULT_FLAG_ALLOW_RETRY or FAULT_FLAG_RETRY_NOWAIT), and that would be
wrong. Those get set on the first attempt to say "you _may_ retry",
but they get set on the first one.

That just shows how much I tested the patch I sent out. It was
whitespace-damaged on purpose, but I still want to check.

The "FAULT_FLAG_TRIED" bit I believe is reasonable to test. That one
literally says "I've gone through this once already, don't bother with
spurious faults".  But I don't think it triggers much in practice. We
seldom actually retry faults, it needs a page that we actually start
IO on (and dropped the mmap lock for) to happen. It wouldn't happen on
the "turn existing page dirty" case, for example.

The "FAULT_FLAG_WRITE" bit is what we test right now. I think it's
wrong. I think it is a "this happens to work" bit, and cuts down on a
lot of common cases, by simply skipping something that might be needed
but basically never is.

So I think a lot of this is dodgy. It doesn't matter on x86, and
nobody cared. Because x86 will always re-walk the page tables before
taking an architectural fault (the same way it walks them for
dirty/accessed bit updates - you could think of x86 as doing all the
things everybody else does in software, they just do in the hw walker
micro-fault logic instead).

A local TLB invalidate of a single virtual address should be basically
free. We're talking single cycles kind of free. The problem here isn't
the flush_tlb_fix_spurious_fault() itself, the problem here is that
arm64 (and pretty much everybody else who uses the default fallback)
does something horribly horribly wrong, and doesn't do the free
version.

               Linus