On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote: > On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote: > > I don't think we should be particularly clever about this and instead just > > flush the full mm if there is a risk of a parallel batching of flushing is > > in progress resulting in a stale TLB entry being used. I think tracking mms > > that are currently batching would end up being costly in terms of memory, > > fairly complex, or both. Something like this? > > > > mremap and madvise(DONTNEED) would also need to flush. Memory policies are > fine as a move_pages call that hits the race will simply fail to migrate > a page that is being freed and once migration starts, it'll be flushed so > a stale access has no further risk. copy_page_range should also be ok as > the old mm is flushed and the new mm cannot have entries yet. > Adding those results in ---8<--- mm, mprotect: Flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not a data integrity issue but it is a correctness issue. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either page reference counts are elevated preventing parallel reclaim or in the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Signed-off-by: Mel Gorman <mgorman@xxxxxxx> Cc: stable@xxxxxxxxxxxxxxx # v4.4+ --- include/linux/mm_types.h | 4 ++++ mm/internal.h | 5 ++++- mm/madvise.c | 1 + mm/memory.c | 1 + mm/mprotect.c | 3 +++ mm/mremap.c | 1 + mm/rmap.c | 24 +++++++++++++++++++++++- 7 files changed, 37 insertions(+), 2 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 45cdb27791a3..ab8f7e11c160 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -495,6 +495,10 @@ struct mm_struct { */ bool tlb_flush_pending; #endif +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + /* See flush_tlb_batched_pending() */ + bool tlb_flush_batched; +#endif struct uprobes_state uprobes_state; #ifdef CONFIG_HUGETLB_PAGE atomic_long_t hugetlb_usage; diff --git a/mm/internal.h b/mm/internal.h index 0e4f558412fb..bf835a5a9854 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH void try_to_unmap_flush(void); void try_to_unmap_flush_dirty(void); +void flush_tlb_batched_pending(struct mm_struct *mm); #else static inline void try_to_unmap_flush(void) { @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) static inline void try_to_unmap_flush_dirty(void) { } - +static inline void mm_tlb_flush_batched(struct mm_struct *mm) +{ +} #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; diff --git a/mm/madvise.c b/mm/madvise.c index 25b78ee4fc2c..75d2cffbe61d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -320,6 +320,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, tlb_remove_check_page_size_change(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = *pte; diff --git a/mm/memory.c b/mm/memory.c index bb11c474857e..b0c3d1556a94 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, init_rss_vec(rss); start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); pte = start_pte; + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { pte_t ptent = *pte; diff --git a/mm/mprotect.c b/mm/mprotect.c index 8edd0d576254..27135b91a4b4 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -61,6 +61,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (!pte) return 0; + /* Guard against parallel reclaim batching a TLB flush without PTL */ + flush_tlb_batched_pending(vma->vm_mm); + /* Get target node for single threaded private VMAs */ if (prot_numa && !(vma->vm_flags & VM_SHARED) && atomic_read(&vma->vm_mm->mm_users) == 1) diff --git a/mm/mremap.c b/mm/mremap.c index cd8a1b199ef9..6e3d857458de 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -152,6 +152,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, new_ptl = pte_lockptr(mm, new_pmd); if (new_ptl != old_ptl) spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, diff --git a/mm/rmap.c b/mm/rmap.c index d405f0e0ee96..5a3e4ff9c4a0 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) return false; /* If remote CPUs need to be flushed then defer batch the flush */ - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) { should_defer = true; + mm->tlb_flush_batched = true; + } put_cpu(); return should_defer; } + +/* + * Reclaim unmaps pages under the PTL but does not flush the TLB prior to + * releasing the PTL if TLB flushes are batched. It's possible a parallel + * operation such as mprotect or munmap to race between reclaim unmapping + * the page and flushing the page If this race occurs, it potentially allows + * access to data via a stale TLB entry. Tracking all mm's that have TLB + * batching pending would be expensive during reclaim so instead track + * whether TLB batching occured in the past and if so then do a full mmi + * flush here. This will cost one additional flush per reclaim cycle paid + * by the first operation at risk such as mprotect and mumap. This assumes + * it's called under the PTL to synchronise access to mm->tlb_flush_batched. + */ +void flush_tlb_batched_pending(struct mm_struct *mm) +{ + if (mm->tlb_flush_batched) { + flush_tlb_mm(mm); + mm->tlb_flush_batched = false; + } +} #else static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>