Mel Gorman <mgorman@xxxxxxx> wrote: > On Tue, Jul 11, 2017 at 10:23:50AM -0700, Andrew Lutomirski wrote: >> On Tue, Jul 11, 2017 at 8:53 AM, Mel Gorman <mgorman@xxxxxxx> wrote: >>> On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote: >>>> On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman <mgorman@xxxxxxx> wrote: >>>>> + >>>>> +/* >>>>> + * This is called after an mprotect update that altered no pages. Batched >>>>> + * unmap releases the PTL before a flush occurs leaving a window where >>>>> + * an mprotect that reduces access rights can still access the page after >>>>> + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing >>>>> + * the local TLB if mprotect updates no pages so that the the caller of >>>>> + * mprotect always gets expected behaviour. It's overkill and unnecessary to >>>>> + * flush all TLBs as a separate thread accessing the data that raced with >>>>> + * both reclaim and mprotect as there is no risk of data corruption and >>>>> + * the exact timing of a parallel thread seeing a protection update without >>>>> + * any serialisation on the application side is always uncertain. >>>>> + */ >>>>> +void batched_unmap_protection_update(void) >>>>> +{ >>>>> + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); >>>>> + local_flush_tlb(); >>>>> + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); >>>>> +} >>>>> + >>>> >>>> What about remote CPUs? You could get migrated right after mprotect() >>>> or the inconsistency could be observed on another CPU. >>> >>> If it's migrated then it has also context switched so the TLB entry will >>> be read for the first time. >> >> I don't think this is true. On current kernels, if the other CPU is >> running a thread in the same process, then there won't be a flush if >> we migrate there. > > True although that would also be covered if a flush happening unconditionally > on mprotect (and arguably munmap) if a batched TLB flush took place in the > past. It's heavier than it needs to be but it would be trivial to track > and only incur a cost if reclaim touched any pages belonging to the process > in the past so a relatively rare operation in the normal case. It could be > forced by continually keeping a system under memory pressure while looping > around mprotect but the worst-case would be similar costs to never batching > the flushing at all. > >> In -tip, slated for 4.13, if the other CPU is lazy >> and is using the current process's page tables, it won't flush if we >> migrate there and it's not stale (as determined by the real flush >> APIs, not local_tlb_flush()). With PCID, the kernel will aggressively >> try to avoid the flush no matter what. > > I agree that PCID means that flushing needs to be more agressive and there > is not much point working on two solutions and assume PCID is merged. > >>> If the entry is inconsistent for another CPU >>> accessing the data then it'll potentially successfully access a page that >>> was just mprotected but this is similar to simply racing with the call >>> to mprotect itself. The timing isn't exact, nor does it need to be. >> >> Thread A: >> mprotect(..., PROT_READ); >> pthread_mutex_unlock(); >> >> Thread B: >> pthread_mutex_lock(); >> write to the mprotected address; >> >> I think it's unlikely that this exact scenario will affect a >> conventional C program, but I can see various GC systems and sandboxes >> being very surprised. > > Maybe. The window is massively wide as the mprotect, unlock, remote wakeup > and write all need to complete between the unmap releasing the PTL and > the flush taking place. Still, it is theoritically possible. Consider also virtual machines. A VCPU may be preempted by the hypervisor right after a PTE change and before the flush - so the time between the two can be rather large. >>>> I also really >>>> don't like bypassing arch code like this. The implementation of >>>> flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!) >>>> is *very* different from what's there now, and it is not written in >>>> the expectation that some generic code might call local_tlb_flush() >>>> and expect any kind of coherency at all. >>> >>> Assuming that gets merged first then the most straight-forward approach >>> would be to setup a arch_tlbflush_unmap_batch with just the local CPU set >>> in the mask or something similar. >> >> With what semantics? > > I'm dropping this idea because the more I think about it, the more I think > that a more general flush is needed if TLB batching was used in the past. > We could keep active track of mm's with flushes pending but it would be > fairly complex, cost in terms of keeping track of mm's needing flushing > and ultimately might be more expensive than just flushing immediately. > > If it's actually unfixable then, even though it's theoritical given the > massive amount of activity that has to happen in a very short window, there > would be no choice but to remove the TLB batching entirely which would be > very unfortunate given that IPIs during reclaim will be very high once again. > >>>> Would a better fix perhaps be to find a way to figure out whether a >>>> batched flush is pending on the mm in question and flush it out if you >>>> do any optimizations based on assuming that the TLB is in any respect >>>> consistent with the page tables? With the changes in -tip, x86 could, >>>> in principle, supply a function to sync up its TLB state. That would >>>> require cross-CPU poking at state or an inconditional IPI (that might >>>> end up not flushing anything), but either is doable. >>> >>> It's potentially doable if a field like tlb_flush_pending was added >>> to mm_struct that is set when batching starts. I don't think there is >>> a logical place where it can be cleared as when the TLB gets flushed by >>> reclaim, it can't rmap again to clear the flag. What would happen is that >>> the first mprotect after any batching happened at any point in the past >>> would have to unconditionally flush the TLB and then clear the flag. That >>> would be a relatively minor hit and cover all the possibilities and should >>> work unmodified with or without your series applied. >>> >>> Would that be preferable to you? >> >> I'm not sure I understand it well enough to know whether I like it. >> I'm imagining an API that says "I'm about to rely on TLBs being >> coherent for this mm -- make it so". > > I don't think we should be particularly clever about this and instead just > flush the full mm if there is a risk of a parallel batching of flushing is > in progress resulting in a stale TLB entry being used. I think tracking mms > that are currently batching would end up being costly in terms of memory, > fairly complex, or both. Something like this? > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 45cdb27791a3..ab8f7e11c160 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -495,6 +495,10 @@ struct mm_struct { > */ > bool tlb_flush_pending; > #endif > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > + /* See flush_tlb_batched_pending() */ > + bool tlb_flush_batched; > +#endif > struct uprobes_state uprobes_state; > #ifdef CONFIG_HUGETLB_PAGE > atomic_long_t hugetlb_usage; > diff --git a/mm/internal.h b/mm/internal.h > index 0e4f558412fb..bf835a5a9854 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; > #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > void try_to_unmap_flush(void); > void try_to_unmap_flush_dirty(void); > +void flush_tlb_batched_pending(struct mm_struct *mm); > #else > static inline void try_to_unmap_flush(void) > { > @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) > static inline void try_to_unmap_flush_dirty(void) > { > } > - > +static inline void mm_tlb_flush_batched(struct mm_struct *mm) > +{ > +} > #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ > > extern const struct trace_print_flags pageflag_names[]; > diff --git a/mm/memory.c b/mm/memory.c > index bb11c474857e..b0c3d1556a94 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > init_rss_vec(rss); > start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > pte = start_pte; > + flush_tlb_batched_pending(mm); > arch_enter_lazy_mmu_mode(); > do { > pte_t ptent = *pte; > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 8edd0d576254..27135b91a4b4 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -61,6 +61,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > if (!pte) > return 0; > > + /* Guard against parallel reclaim batching a TLB flush without PTL */ > + flush_tlb_batched_pending(vma->vm_mm); > + > /* Get target node for single threaded private VMAs */ > if (prot_numa && !(vma->vm_flags & VM_SHARED) && > atomic_read(&vma->vm_mm->mm_users) == 1) > diff --git a/mm/rmap.c b/mm/rmap.c > index d405f0e0ee96..52633a124a4e 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > return false; > > /* If remote CPUs need to be flushed then defer batch the flush */ > - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) { > should_defer = true; > + mm->tlb_flush_batched = true; > + } > put_cpu(); > > return should_defer; > } > + > +/* > + * Reclaim batches unmaps pages under the PTL but does not flush the TLB > + * TLB prior to releasing the PTL. It's possible a parallel mprotect or > + * munmap can race between reclaim unmapping the page and flushing the > + * page. If this race occurs, it potentially allows access to data via > + * a stale TLB entry. Tracking all mm's that have TLB batching pending > + * would be expensive during reclaim so instead track whether TLB batching > + * occured in the past and if so then do a full mm flush here. This will > + * cost one additional flush per reclaim cycle paid by the first munmap or > + * mprotect. This assumes it's called under the PTL to synchronise access > + * to mm->tlb_flush_batched. > + */ > +void flush_tlb_batched_pending(struct mm_struct *mm) > +{ > + if (mm->tlb_flush_batched) { > + flush_tlb_mm(mm); > + mm->tlb_flush_batched = false; > + } > +} > #else > static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > { I don’t know what is exactly the invariant that is kept, so it is hard for me to figure out all sort of questions: Should pte_accessible return true if mm->tlb_flush_batch==true ? Does madvise_free_pte_range need to be modified as well? How will future code not break anything? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href