Re: Potential race in TLB flush batching?

Mel Gorman <mgorman@xxxxxxx> · Tue, 11 Jul 2017 14:20:23 +0100

On Tue, Jul 11, 2017 at 03:40:02AM -0700, Nadav Amit wrote:
> Mel Gorman <mgorman@xxxxxxx> wrote:
> 
> >>> That is the same to a race whereby there is no batching mechanism and the
> >>> racing operation happens between a pte clear and a flush as ptep_clear_flush
> >>> is not atomic. All that differs is that the race window is a different size.
> >>> The application on CPU1 is buggy in that it may or may not succeed the write
> >>> but it is buggy regardless of whether a batching mechanism is used or not.
> >> 
> >> Thanks for your quick and detailed response, but I fail to see how it can
> >> happen without batching. Indeed, the PTE clear and flush are not ???atomic???,
> >> but without batching they are both performed under the page table lock
> >> (which is acquired in page_vma_mapped_walk and released in
> >> page_vma_mapped_walk_done). Since the lock is taken, other cores should not
> >> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
> >> and change_pte_range, acquire the lock before accessing the PTEs.
> > 
> > I was primarily thinking in terms of memory corruption or data loss.
> > However, we are still protected although it's not particularly obvious why.
> > 
> > On the reclaim side, we are either reclaiming clean pages (which ignore
> > the accessed bit) or normal reclaim. If it's clean pages then any parallel
> > write must update the dirty bit at minimum. If it's normal reclaim then
> > the accessed bit is checked and if cleared in try_to_unmap_one, it uses a
> > ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim
> > the page in either as part of page_referenced or try_to_unmap_one but
> > clearing the accessed bit flushes the TLB.
> 
> Wait. Are you looking at the x86 arch function? The TLB is not flushed when
> the access bit is cleared:
> 
> int ptep_clear_flush_young(struct vm_area_struct *vma,
>                            unsigned long address, pte_t *ptep)
> {
>         /*
>          * On x86 CPUs, clearing the accessed bit without a TLB flush
>          * doesn't cause data corruption. [ It could cause incorrect
>          * page aging and the (mistaken) reclaim of hot pages, but the
>          * chance of that should be relatively low. ]
>          *                 
>          * So as a performance optimization don't flush the TLB when
>          * clearing the accessed bit, it will eventually be flushed by
>          * a context switch or a VM operation anyway. [ In the rare
>          * event of it not getting flushed for a long time the delay
>          * shouldn't really matter because there's no real memory
>          * pressure for swapout to react to. ]
>          */
>         return ptep_test_and_clear_young(vma, address, ptep);
> }
> 

I forgot this detail, thanks for correcting me.

> > 
> > On the mprotect side then, as the page was first accessed, clearing the
> > accessed bit incurs a TLB flush on the reclaim side before the second write.
> > That means any TLB entry that exists cannot have the accessed bit set so
> > a second write needs to update it.
> > 
> > While it's not clearly documented, I checked with hardware engineers
> > at the time that an update of the accessed or dirty bit even with a TLB
> > entry will check the underlying page tables and trap if it's not present
> > and the subsequent fault will then fail on sigsegv if the VMA protections
> > no longer allow the write.
> > 
> > So, on one side if ignoring the accessed bit during reclaim, the pages
> > are clean so any access will set the dirty bit and trap if unmapped in
> > parallel. On the other side, the accessed bit if set cleared the TLB and
> > if not set, then the hardware needs to update and again will trap if
> > unmapped in parallel.
> 
> 
> Yet, even regardless to the TLB flush it seems there is still a possible
> race:
> 
> CPU0				CPU1
> ----				----
> ptep_clear_flush_young_notify
> ==> PTE.A==0
> 				access PTE
> 				==> PTE.A=1
> prep_get_and_clear
> 				change mapping (and PTE)
> 				Use stale TLB entry

So I think you're right and this is a potential race. The first access can
be a read or a write as it's a problem if the mprotect call restricts
access.

> > If this guarantee from hardware was every shown to be wrong or another
> > architecture wanted to add batching without the same guarantee then mprotect
> > would need to do a local_flush_tlb if no pages were updated by the mprotect
> > but right now, this should not be necessary.
> > 
> >> Can you please explain why you consider the application to be buggy?
> > 
> > I considered it a bit dumb to mprotect for READ/NONE and then try writing
> > the same mapping. However, it will behave as expected.
> 
> I don???t think that this is the only scenario. For example, the application
> may create a new memory mapping of a different file using mmap at the same
> memory address that was used before, just as that memory is reclaimed.

That requires the existing mapping to be unmapped which will flush the
TLB and parallel mmap/munmap serialises on mmap_sem. The race appears to
be specific to mprotect which avoids the TLB flush if no pages were updated.

> The
> application can (inadvertently) cause such a scenario by using MAP_FIXED.
> But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same
> virtual address.
> 

With flushes in between.

> > Such applications are safe due to how the accessed bit is handled by the
> > software (flushes TLB if clearing young) and hardware (traps if updating
> > the accessed or dirty bit and the underlying PTE was unmapped even if
> > there is a TLB entry).
> 
> I don???t think it is so. And I also think there are many additional
> potentially problematic scenarios.
> 

I believe it's specific to mprotect but can be handled by flushing the
local TLB when mprotect updates no pages. Something like this;

---8<---
mm, mprotect: Flush the local TLB if mprotect potentially raced with a parallel reclaim

Nadav Amit identified a theoritical race between page reclaim and mprotect
due to TLB flushes being batched outside of the PTL being held. He described
the race as follows

        CPU0                            CPU1
        ----                            ----
                                        user accesses memory using RW PTE
                                        [PTE now cached in TLB]
        try_to_unmap_one()
        ==> ptep_get_and_clear()
        ==> set_tlb_ubc_flush_pending()
                                        mprotect(addr, PROT_READ)
                                        ==> change_pte_range()
                                        ==> [ PTE non-present - no flush ]

                                        user writes using cached RW PTE
        ...

        try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE.
This is not a data integrity issue as the TLB is always flushed before any
IO is queued or a page is freed but it is a correctness issue as a process
restricting access with mprotect() may still be able to access the data
after the syscall returns due to a stale TLB entry. Handle this issue by
flushing the local TLB if reclaim is potentially batching TLB flushes and
mprotect altered no pages.

Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx # v4.4+
---
 mm/internal.h |  5 ++++-
 mm/mprotect.c | 12 ++++++++++--
 mm/rmap.c     | 20 ++++++++++++++++++++
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558412fb..9b7d1a597816 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void batched_unmap_protection_update(void);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
-
+static inline void batched_unmap_protection_update()
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..3de353d4b5fb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -254,9 +254,17 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 				 dirty_accountable, prot_numa);
 	} while (pgd++, addr = next, addr != end);
 
-	/* Only flush the TLB if we actually modified any entries: */
-	if (pages)
+	/*
+	 * Only flush all TLBs if we actually modified any entries. If no
+	 * pages are modified, then call batched_unmap_protection_update
+	 * if the context is a mprotect() syscall.
+	 */
+	if (pages) {
 		flush_tlb_range(vma, start, end);
+	} else {
+		if (!prot_numa)
+			batched_unmap_protection_update();
+	}
 	clear_tlb_flush_pending(mm);
 
 	return pages;
diff --git a/mm/rmap.c b/mm/rmap.c
index d405f0e0ee96..02cb035e4ce6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -643,6 +643,26 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 
 	return should_defer;
 }
+
+/*
+ * This is called after an mprotect update that altered no pages. Batched
+ * unmap releases the PTL before a flush occurs leaving a window where
+ * an mprotect that reduces access rights can still access the page after
+ * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
+ * the local TLB if mprotect updates no pages so that the the caller of
+ * mprotect always gets expected behaviour. It's overkill and unnecessary to
+ * flush all TLBs as a separate thread accessing the data that raced with
+ * both reclaim and mprotect as there is no risk of data corruption and
+ * the exact timing of a parallel thread seeing a protection update without
+ * any serialisation on the application side is always uncertain.
+ */
+void batched_unmap_protection_update(void)
+{
+	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+	local_flush_tlb();
+	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
+}
+
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>