Mel Gorman <mgorman@xxxxxxx> wrote: > On Tue, Jul 11, 2017 at 03:40:02AM -0700, Nadav Amit wrote: >> Mel Gorman <mgorman@xxxxxxx> wrote: >> >>>>> That is the same to a race whereby there is no batching mechanism and the >>>>> racing operation happens between a pte clear and a flush as ptep_clear_flush >>>>> is not atomic. All that differs is that the race window is a different size. >>>>> The application on CPU1 is buggy in that it may or may not succeed the write >>>>> but it is buggy regardless of whether a batching mechanism is used or not. >>>> >>>> Thanks for your quick and detailed response, but I fail to see how it can >>>> happen without batching. Indeed, the PTE clear and flush are not ???atomic???, >>>> but without batching they are both performed under the page table lock >>>> (which is acquired in page_vma_mapped_walk and released in >>>> page_vma_mapped_walk_done). Since the lock is taken, other cores should not >>>> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range >>>> and change_pte_range, acquire the lock before accessing the PTEs. >>> >>> I was primarily thinking in terms of memory corruption or data loss. >>> However, we are still protected although it's not particularly obvious why. >>> >>> On the reclaim side, we are either reclaiming clean pages (which ignore >>> the accessed bit) or normal reclaim. If it's clean pages then any parallel >>> write must update the dirty bit at minimum. If it's normal reclaim then >>> the accessed bit is checked and if cleared in try_to_unmap_one, it uses a >>> ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim >>> the page in either as part of page_referenced or try_to_unmap_one but >>> clearing the accessed bit flushes the TLB. >> >> Wait. Are you looking at the x86 arch function? The TLB is not flushed when >> the access bit is cleared: >> >> int ptep_clear_flush_young(struct vm_area_struct *vma, >> unsigned long address, pte_t *ptep) >> { >> /* >> * On x86 CPUs, clearing the accessed bit without a TLB flush >> * doesn't cause data corruption. [ It could cause incorrect >> * page aging and the (mistaken) reclaim of hot pages, but the >> * chance of that should be relatively low. ] >> * >> * So as a performance optimization don't flush the TLB when >> * clearing the accessed bit, it will eventually be flushed by >> * a context switch or a VM operation anyway. [ In the rare >> * event of it not getting flushed for a long time the delay >> * shouldn't really matter because there's no real memory >> * pressure for swapout to react to. ] >> */ >> return ptep_test_and_clear_young(vma, address, ptep); >> } > > I forgot this detail, thanks for correcting me. > >>> On the mprotect side then, as the page was first accessed, clearing the >>> accessed bit incurs a TLB flush on the reclaim side before the second write. >>> That means any TLB entry that exists cannot have the accessed bit set so >>> a second write needs to update it. >>> >>> While it's not clearly documented, I checked with hardware engineers >>> at the time that an update of the accessed or dirty bit even with a TLB >>> entry will check the underlying page tables and trap if it's not present >>> and the subsequent fault will then fail on sigsegv if the VMA protections >>> no longer allow the write. >>> >>> So, on one side if ignoring the accessed bit during reclaim, the pages >>> are clean so any access will set the dirty bit and trap if unmapped in >>> parallel. On the other side, the accessed bit if set cleared the TLB and >>> if not set, then the hardware needs to update and again will trap if >>> unmapped in parallel. >> >> >> Yet, even regardless to the TLB flush it seems there is still a possible >> race: >> >> CPU0 CPU1 >> ---- ---- >> ptep_clear_flush_young_notify >> ==> PTE.A==0 >> access PTE >> ==> PTE.A=1 >> prep_get_and_clear >> change mapping (and PTE) >> Use stale TLB entry > > So I think you're right and this is a potential race. The first access can > be a read or a write as it's a problem if the mprotect call restricts > access. > >>> If this guarantee from hardware was every shown to be wrong or another >>> architecture wanted to add batching without the same guarantee then mprotect >>> would need to do a local_flush_tlb if no pages were updated by the mprotect >>> but right now, this should not be necessary. >>> >>>> Can you please explain why you consider the application to be buggy? >>> >>> I considered it a bit dumb to mprotect for READ/NONE and then try writing >>> the same mapping. However, it will behave as expected. >> >> I don???t think that this is the only scenario. For example, the application >> may create a new memory mapping of a different file using mmap at the same >> memory address that was used before, just as that memory is reclaimed. > > That requires the existing mapping to be unmapped which will flush the > TLB and parallel mmap/munmap serialises on mmap_sem. The race appears to > be specific to mprotect which avoids the TLB flush if no pages were updated. Why? As far as I see the chain of calls during munmap is somewhat like: do_munmap =>unmap_region ==>tlb_gather_mmu ===>unmap_vmas ====>unmap_page_range ... =====>zap_pte_range - this one batches only present PTEs ===>free_pgtables - this one is only if page-tables are removed ===>pte_free_tlb ==>tlb_finish_mmu ===>tlb_flush_mmu ====>tlb_flush_mmu_tlbonly zap_pte_range will check if pte_none and can find it is - if a concurrent try_to_unmap_one already cleared the PTE. In this case it will not update the range of the mmu_gather and would not indicate that a flush of the PTE is needed. Then, tlb_flush_mmu_tlbonly will find that no PTE was cleared (tlb->end == 0) and avoid flush, or may just flush fewer PTEs than actually needed. Due to this behavior, it raises a concern that in other cases as well, when mmu_gather is used, a PTE flush may be missed. >> The >> application can (inadvertently) cause such a scenario by using MAP_FIXED. >> But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same >> virtual address. > > With flushes in between. > >>> Such applications are safe due to how the accessed bit is handled by the >>> software (flushes TLB if clearing young) and hardware (traps if updating >>> the accessed or dirty bit and the underlying PTE was unmapped even if >>> there is a TLB entry). >> >> I don???t think it is so. And I also think there are many additional >> potentially problematic scenarios. > > I believe it's specific to mprotect but can be handled by flushing the > local TLB when mprotect updates no pages. Something like this; > > ---8<--- > mm, mprotect: Flush the local TLB if mprotect potentially raced with a parallel reclaim > > Nadav Amit identified a theoritical race between page reclaim and mprotect > due to TLB flushes being batched outside of the PTL being held. He described > the race as follows > > CPU0 CPU1 > ---- ---- > user accesses memory using RW PTE > [PTE now cached in TLB] > try_to_unmap_one() > ==> ptep_get_and_clear() > ==> set_tlb_ubc_flush_pending() > mprotect(addr, PROT_READ) > ==> change_pte_range() > ==> [ PTE non-present - no flush ] > > user writes using cached RW PTE > ... > > try_to_unmap_flush() > > The same type of race exists for reads when protecting for PROT_NONE. > This is not a data integrity issue as the TLB is always flushed before any > IO is queued or a page is freed but it is a correctness issue as a process > restricting access with mprotect() may still be able to access the data > after the syscall returns due to a stale TLB entry. Handle this issue by > flushing the local TLB if reclaim is potentially batching TLB flushes and > mprotect altered no pages. > > Signed-off-by: Mel Gorman <mgorman@xxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx # v4.4+ > --- > mm/internal.h | 5 ++++- > mm/mprotect.c | 12 ++++++++++-- > mm/rmap.c | 20 ++++++++++++++++++++ > 3 files changed, 34 insertions(+), 3 deletions(-) > > diff --git a/mm/internal.h b/mm/internal.h > index 0e4f558412fb..9b7d1a597816 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; > #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > void try_to_unmap_flush(void); > void try_to_unmap_flush_dirty(void); > +void batched_unmap_protection_update(void); > #else > static inline void try_to_unmap_flush(void) > { > @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) > static inline void try_to_unmap_flush_dirty(void) > { > } > - > +static inline void batched_unmap_protection_update() > +{ > +} > #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ > > extern const struct trace_print_flags pageflag_names[]; > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 8edd0d576254..3de353d4b5fb 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -254,9 +254,17 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, > dirty_accountable, prot_numa); > } while (pgd++, addr = next, addr != end); > > - /* Only flush the TLB if we actually modified any entries: */ > - if (pages) > + /* > + * Only flush all TLBs if we actually modified any entries. If no > + * pages are modified, then call batched_unmap_protection_update > + * if the context is a mprotect() syscall. > + */ > + if (pages) { > flush_tlb_range(vma, start, end); > + } else { > + if (!prot_numa) > + batched_unmap_protection_update(); > + } > clear_tlb_flush_pending(mm); > > return pages; > diff --git a/mm/rmap.c b/mm/rmap.c > index d405f0e0ee96..02cb035e4ce6 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -643,6 +643,26 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > > return should_defer; > } > + > +/* > + * This is called after an mprotect update that altered no pages. Batched > + * unmap releases the PTL before a flush occurs leaving a window where > + * an mprotect that reduces access rights can still access the page after > + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing > + * the local TLB if mprotect updates no pages so that the the caller of > + * mprotect always gets expected behaviour. It's overkill and unnecessary to > + * flush all TLBs as a separate thread accessing the data that raced with > + * both reclaim and mprotect as there is no risk of data corruption and > + * the exact timing of a parallel thread seeing a protection update without > + * any serialisation on the application side is always uncertain. > + */ > +void batched_unmap_protection_update(void) > +{ > + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > + local_flush_tlb(); > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); > +} > + > #else > static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > { I don’t think this solution is enough. I am sorry for not providing a solution, but I don’t see an easy one. Thanks, Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href