Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault

Yu Xu <xuyu@xxxxxxxxxxxxxxxxx> · Mon, 27 Jul 2020 15:31:16 +0800

On 7/25/20 4:22 AM, Linus Torvalds wrote:
On Fri, Jul 24, 2020 at 12:27 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

It *may* make sense to say "ok, don't bother flushing the TLB if this
is a retry, because we already did that originally". MAYBE.

That sounds wrong to me too.

Maybe a *BIG* and understandable comment about why the
FAULT_FLAG_TRIED case shouldn't actually do anything at all.

But it does smell like the real issue is that the "else" case for
ptep_set_access_flags() is simply wrong.

Or maybe something else. But this thing from the changelog really
raises my hackles:

        "But it seems not necessary to modify those
   bits again for #3 since they should be already set by the first page fault
   try"

since we just verified that we know _exactly_ what the pte is:

         if (unlikely(!pte_same(*vmf->pte, entry))) {
                 update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
                 goto unlock;
         }

so there is no "should be already set" case. We have 100% information
about what the current state is.

And if we don't make any changes, then that's exactly when
ptep_set_access_flags() returns zero.

So the real question is "why do we need the
flush_tlb_fix_spurious_fault() thing".

Yes, some debugging information shows that tlb flush does happen in 
flush_tlb_fix_spurious_fault(), i.e., "else" case for 
ptep_set_access_flags().


We could say that we never need it at all for FAULT_FLAG_RETRY. That
makes a lot of sense to me.

So a patch that does something like the appended (intentionally
whitespace-damaged) seems sensible.

I tested your patch on our aarch64 box, with 128 online CPUs.

The testcase is 
[page_fault3](https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c) 
from will-it-scale suite, with test parameters as number of processes or 
threads to run.

test          vanilla             patched
parameter     (89b15332af7c)      (Linus's patch)
1p            829299              772028    (94.44 %)
1t            998007              925583    (91.89 %)
32p           18916718            18712348  (98.86 %)
32t           2020918             1687744   (69.43 %)
64p           18965168            18982148  (100.05 %)
64t           1415404             1649234   (72.42 %)
96p           18949438            18866126  (99.68 %)
96t           1622876             1448309   (73.08 %)
128p          18926813            18423990  (97.53 %)
128t          1643109             0         (0.00 %)


There are two points to sum up.

1) the performance of page_fault3_process is restored, while the 
performance of page_fault3_thread is about ~80% of the vanilla, except 
the case of 128 threads.

2) in the case of 128 threads, test worker threads seem to get stuck, 
making no progress in the iterations of mmap-write-munmap until a period 
of time later.  the test result is 0 because only first 16 samples are 
counted, and they are all 0.  This situation is easy to re-produce with 
large number of threads (not necessarily 128), and the stack of one 
stuck thread is shown below.

[<0>] __switch_to+0xdc/0x150
[<0>] wb_wait_for_completion+0x84/0xb0
[<0>] __writeback_inodes_sb_nr+0x9c/0xe8
[<0>] try_to_writeback_inodes_sb+0x6c/0x88
[<0>] ext4_nonda_switch+0x90/0x98 [ext4]
[<0>] ext4_page_mkwrite+0x248/0x4c0 [ext4]
[<0>] do_page_mkwrite+0x4c/0x100
[<0>] do_fault+0x2ac/0x3e0
[<0>] handle_pte_fault+0xb4/0x258
[<0>] __handle_mm_fault+0x1d8/0x3a8
[<0>] handle_mm_fault+0x104/0x1d0
[<0>] do_page_fault+0x16c/0x490
[<0>] do_translation_fault+0x60/0x68
[<0>] do_mem_abort+0x58/0x100
[<0>] el0_da+0x24/0x28
[<0>] 0xffffffffffffffff

It seems quite normal, right? and I've run out of ideas.


Thanks,
Yu


But note the XYZ in that commit. When do we actually have stale TLB
entries? Do we actually do the lazy "avoid TLB flushing when loosening
the rules" anywhere?

I think that "when it's a write fault" is actually bogus. I could
imagine that code pages could get the same issue. So the
"FAULT_FLAG_RETRY" part of the check makes perfect sense to me, but
the legacy "FAULT_FLAG_WRITE" case I'd actually want to document more.

On x86, we never care about lazy faults. Page faulting will always
update the TLB.

On other architectures, I can see spurious faults happening either due
to lazy reasons, or simply because another core is modifying the page
table right now (ie the concurrent fault thing), but hasn't actually
flushed yet.

Can somebody flesh out the comment about the
"spurious_protection_fault()" thing? Because something like this I
wouldn't mind, but I'd like that comment to explain the
FAULT_FLAG_WRITE part too.

               Linus

---

diff --git a/mm/memory.c b/mm/memory.c
index 3ecad55103ad..9994c98d88c3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4163,6 +4163,26 @@ static vm_fault_t wp_huge_pud(struct vm_fault
*vmf, pud_t orig_pud)
          return VM_FAULT_FALLBACK;
  }

+/*
+ * If ptep_set_access_flags() returns zero, that means that
+ * it made no changes. Why did we get a fault?
+ *
+ * It might be a spurious protection fault because we at
+ * some point lazily didn't flush a TLB when we only loosened
+ * the protection rules. But it might also be because a
+ * concurrent fault on another CPU had already marked things
+ * young, and our young/dirty changes didn't change anything.
+ *
+ * The lazy TLB optimization only happens when we make things
+ * writable. See XYZ.
+ */
+static inline bool spurious_protection_fault(unsigned int flags)
+{
+        if (flags & FAULT_FLAG_RETRY)
+                return false;
+        return flags & FAULT_FLAG_WRITE;
+}
+
  /*
   * These routines also need to handle stuff like marking pages dirty
   * and/or accessed for architectures that don't do it in hardware (most
@@ -4247,15 +4267,8 @@ static vm_fault_t handle_pte_fault(struct
vm_fault *vmf)
          if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
                                  vmf->flags & FAULT_FLAG_WRITE)) {
                  update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
-        } else {
-                /*
-                 * This is needed only for protection faults but the
arch code
-                 * is not yet telling us if this is a protection
fault or not.
-                 * This still avoids useless tlb flushes for .text
page faults
-                 * with threads.
-                 */
-                if (vmf->flags & FAULT_FLAG_WRITE)
-                        flush_tlb_fix_spurious_fault(vmf->vma,
vmf->address);
+        } else if (spurious_protection_fault(vmf->flags)) {
+                flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
          }
  unlock:
          pte_unmap_unlock(vmf->pte, vmf->ptl);