On Fri, Jul 14, 2017 at 04:16:44PM -0700, Nadav Amit wrote: > Mel Gorman <mgorman@xxxxxxx> wrote: > > > On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote: > >>> If reclaim is first, it'll take the PTL, set batched while a racing > >>> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap > >>> immediately calls flush_tlb_batched_pending() before proceeding as normal, > >>> finding pte_none with the TLB flushed. > >> > >> This is the scenario I regarded in my example. Notice that when the first > >> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table > >> locks - allowing them to run concurrently. As a result > >> flush_tlb_batched_pending is executed before the PTE was cleared and > >> mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear > >> mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE. > > > > If they hold different PTL locks, it means that reclaim and and the parallel > > munmap/mprotect/madvise/mremap operation are operating on different regions > > of an mm or separate mm's and the race should not apply or at the very > > least is equivalent to not batching the flushes. For multiple parallel > > operations, munmap/mprotect/mremap are serialised by mmap_sem so there > > is only one risky operation at a time. For multiple madvise, there is a > > small window when a page is accessible after madvise returns but it is an > > advisory call so it's primarily a data integrity concern and the TLB is > > flushed before the page is either freed or IO starts on the reclaim side. > > I think there is some miscommunication. Perhaps one detail was missing: > > CPU0 CPU1 > ---- ---- > should_defer_flush > => mm->tlb_flush_batched=true > flush_tlb_batched_pending (another PT) > => flush TLB > => mm->tlb_flush_batched=false > > Access PTE (and cache in TLB) > ptep_get_and_clear(PTE) > ... > > flush_tlb_batched_pending (batched PT) > [ no flush since tlb_flush_batched=false ] > use the stale PTE > ... > try_to_unmap_flush > > There are only 2 CPUs and both regard the same address-space. CPU0 reclaim a > page from this address-space. Just between setting tlb_flush_batch and the > actual clearing of the PTE, the process on CPU1 runs munmap and calls > flush_tlb_batched_pending. This can happen if CPU1 regards a different > page-table. > If both regard the same address-space then they have the same page table so there is a disconnect between the first and last sentence in your paragraph above. On CPU 0, the setting of tlb_flush_batched and ptep_get_and_clear is also reversed as the sequence is pteval = ptep_get_and_clear(mm, address, pvmw.pte); set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); Additional barriers should not be needed as within the critical section that can race, it's protected by the lock and with Andy's code, there is a full barrier before the setting of tlb_flush_batched. With Andy's code, there may be a need for a compiler barrier but I can rethink about that and add it during the backport to -stable if necessary. So the setting happens after the clear and if they share the same address space and collide then they both share the same PTL so are protected from each other. If there are separate address spaces using a shared mapping then the same race does not occur. > > +/* > > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when > > + * this returns. Using the current mm tlb_gen means the TLB will be up to date > > + * with respect to the tlb_gen set at arch_tlbbatch_add_mm. If a flush has > > + * happened since then the IPIs will still be sent but the actual flush is > > + * avoided. Unfortunately the IPIs are necessary as the per-cpu context > > + * tlb_gens cannot be safely accessed. > > + */ > > +void arch_tlbbatch_flush_one_mm(struct mm_struct *mm) > > +{ > > + int cpu; > > + struct flush_tlb_info info = { > > + .mm = mm, > > + .new_tlb_gen = atomic64_read(&mm->context.tlb_gen), > > + .start = 0, > > + .end = TLB_FLUSH_ALL, > > + }; > > + > > + cpu = get_cpu(); > > + > > + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { > > + VM_WARN_ON(irqs_disabled()); > > + local_irq_disable(); > > + flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); > > + local_irq_enable(); > > + } > > + > > + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) > > + flush_tlb_others(mm_cpumask(mm), &info); > > + > > + put_cpu(); > > +} > > + > > It is a shame that after Andy collapsed all the different flushing flows, > you create a new one. How about squashing this untested one to yours? > The patch looks fine to be but when writing the patch, I wondered why the original code disabled preemption before inc_mm_tlb_gen. I didn't spot the reason for it but given the importance of properly synchronising with switch_mm, I played it safe. However, this should be ok on top and maintain the existing sequences diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 248063dc5be8..cbd8621a0bee 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -404,6 +404,21 @@ void native_flush_tlb_others(const struct cpumask *cpumask, */ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33; +static void flush_tlb_mm_common(struct flush_tlb_info *info, int cpu) +{ + struct mm_struct *mm = info->mm; + + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { + VM_WARN_ON(irqs_disabled()); + local_irq_disable(); + flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); + local_irq_enable(); + } + + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) + flush_tlb_others(mm_cpumask(mm), info); +} + void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) { @@ -429,15 +444,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, info.end = TLB_FLUSH_ALL; } - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { - VM_WARN_ON(irqs_disabled()); - local_irq_disable(); - flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); - local_irq_enable(); - } - - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), &info); + flush_tlb_mm_common(&info, cpu); put_cpu(); } @@ -515,7 +522,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) */ void arch_tlbbatch_flush_one_mm(struct mm_struct *mm) { - int cpu; struct flush_tlb_info info = { .mm = mm, .new_tlb_gen = atomic64_read(&mm->context.tlb_gen), @@ -523,17 +529,7 @@ void arch_tlbbatch_flush_one_mm(struct mm_struct *mm) .end = TLB_FLUSH_ALL, }; - cpu = get_cpu(); - - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { - VM_WARN_ON(irqs_disabled()); - local_irq_disable(); - flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); - local_irq_enable(); - } - - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), &info); + flush_tlb_mm_common(&info, get_cpu()); put_cpu(); } -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>