Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault

Yu Xu <xuyu@xxxxxxxxxxxxxxxxx> · Tue, 28 Jul 2020 18:07:44 +0800

On 7/28/20 5:39 PM, Catalin Marinas wrote:
On Tue, Jul 28, 2020 at 10:22:20AM +0100, Will Deacon wrote:
On Sat, Jul 25, 2020 at 04:58:43PM +0100, Catalin Marinas wrote:
On Fri, Jul 24, 2020 at 06:29:43PM -0700, Linus Torvalds wrote:
For any architecture that guarantees that a page fault will always
flush the old TLB entry for this kind of situation, that
flush_tlb_fix_spurious_fault() thing can be a no-op.

So that's why on x86, we just do

   #define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)

and have no issues.

Note that it does *not* need to do any cross-CPU flushing or anything
like that. So it's actually wrong (I think) to have that default
fallback for

    #define flush_tlb_fix_spurious_fault(vma, address)
flush_tlb_page(vma, address)

because flush_tlb_page() is the serious "do cross CPU etc".

Does the arm64 flush_tlb_page() perhaps do the whole expensive
cross-CPU thing rather than the much cheaper "just local invalidate"
version?

I think it makes sense to have a local-only
flush_tlb_fix_spurious_fault(), but with ptep_set_access_flags() updated
to still issue the full broadcast TLBI. In addition, I added a minor
optimisation to avoid the TLB flush if the old pte was not accessible.
In a read-access fault case (followed by mkyoung), the TLB wouldn't have
cached a non-accessible pte (not sure it makes much difference to Yang's
case). Anyway, from ARMv8.1 onwards, the hardware handles the access
flag automatically.

I'm not sure the first dsb(nshst) below is of much use in this case. If
we got a spurious fault, the write to the pte happened on a different
CPU (IIUC, we shouldn't return to user with updated ptes without a TLB
flush on the same CPU). Anyway, we can refine this if it solves Yang's
performance regression.

-------------8<-----------------------

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index d493174415db..d1401cbad7d4 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -268,6 +268,20 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
  	dsb(ish);
  }
  
+static inline void local_flush_tlb_page(struct vm_area_struct *vma,
+					unsigned long uaddr)
+{
+	unsigned long addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+
+	dsb(nshst);
+	__tlbi(vale1, addr);
+	__tlbi_user(vale1, addr);
+	dsb(nsh);
+}
+
+#define flush_tlb_fix_spurious_fault(vma, address) \
+	local_flush_tlb_page(vma, address)

Why can't we just have flush_tlb_fix_spurious_fault() be a NOP on arm64?

Possibly, as long as any other optimisations only defer the TLB flushing
for relatively short time (the fault is transient, it will get a
broadcast TLBI eventually).

Either way, it's worth benchmarking the above patch but with
flush_tlb_fix_spurious_fault() a no-op (we still need flush_tlb_page()
in ptep_set_access_flags()). Xu, Yang, could you please give it a try?

If I understand correctly, this should do as good as the patch of Linux
or Yang in will-it-scale page_fault3 testcase, which both avoid doing
flush_tlb_fix_spurious_fault(), in case of FAULT_FLAG_TRIED.

Shouldn't we be concerned about data integrity if to have
flush_tlb_fix_spurious_fault() be a nop on arm64?

Thanks,
Yu


Given that the architecture prohibits the TLB from caching invalid entries,
then software access/dirty is fine without additional flushing.

The access fault is fine, the TLB has not cached the entry. For a dirty
fault, however, the TLB could cache a read-only mapping, so it does need
flushing. Question is, do we make the pte dirty anywhere without a
subsequent (broadcast) TLBI?

The only
problematic case I can think of is on the invalid->valid (i.e. map) path,
where we elide the expensive DSB instruction because (a) most CPUs have a
walker that can snoop the store buffer and (b) even if they don't, the
store buffer tends to drain by the time we get back to userspace. Even
if that was a problem, flush_tlb_fix_spurious_fault() wouldn't be the
right hook, since the DSB must occur on the CPU that did the pte update.

I guess the best a CPU can do is attempt the page table walk again, in
the hope that the write buffer on the other CPU eventually drains.