On Mon, Jan 11, 2016 at 10:25 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > On Mon, Jan 11, 2016 at 03:42:40AM -0800, tip-bot for Andy Lutomirski wrote: >> --- a/arch/x86/include/asm/mmu_context.h >> +++ b/arch/x86/include/asm/mmu_context.h >> @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, >> #endif >> cpumask_set_cpu(cpu, mm_cpumask(next)); >> >> - /* Re-load page tables */ >> + /* >> + * Re-load page tables. >> + * >> + * This logic has an ordering constraint: >> + * >> + * CPU 0: Write to a PTE for 'next' >> + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. >> + * CPU 1: set bit 1 in next's mm_cpumask >> + * CPU 1: load from the PTE that CPU 0 writes (implicit) >> + * >> + * We need to prevent an outcome in which CPU 1 observes >> + * the new PTE value and CPU 0 observes bit 1 clear in >> + * mm_cpumask. (If that occurs, then the IPI will never >> + * be sent, and CPU 0's TLB will contain a stale entry.) >> + * >> + * The bad outcome can occur if either CPU's load is >> + * reordered before that CPU's store, so both CPUs much > > s/much/must/ ? Indeed. Is this worth a follow-up patch? > >> + * execute full barriers to prevent this from happening. >> + * >> + * Thus, switch_mm needs a full barrier between the >> + * store to mm_cpumask and any operation that could load >> + * from next->pgd. This barrier synchronizes with >> + * remote TLB flushers. Fortunately, load_cr3 is >> + * serializing and thus acts as a full barrier. >> + * >> + */ >> load_cr3(next->pgd); >> + >> trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); >> >> /* Stop flush ipis for the previous mm */ >> @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, >> * schedule, protecting us from simultaneous changes. >> */ >> cpumask_set_cpu(cpu, mm_cpumask(next)); >> + >> /* >> * We were in lazy tlb mode and leave_mm disabled >> * tlb flush IPI delivery. We must reload CR3 >> * to make sure to use no freed page tables. >> + * >> + * As above, this is a barrier that forces >> + * TLB repopulation to be ordered after the >> + * store to mm_cpumask. > > somewhat confused by this comment, cpumask_set_cpu() is a LOCK BTS, that > is already fully ordered. There are more than enough barriers here. v1 had cpumask_set_cpu; smp_mb__after_atomic, which is more portable and generates identical code. I don't have a real preference for which barrier we should consider to the important one. > >> */ >> load_cr3(next->pgd); >> trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); >> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c >> index 8ddb5d0..8f4cc3d 100644 > > >> --- a/arch/x86/mm/tlb.c >> +++ b/arch/x86/mm/tlb.c > >> @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, > >> if (!current->mm) { >> leave_mm(smp_processor_id()); >> + >> + /* Synchronize with switch_mm. */ >> + smp_mb(); >> + >> goto out; >> } > >> + } else { >> leave_mm(smp_processor_id()); >> + >> + /* Synchronize with switch_mm. */ >> + smp_mb(); >> + } >> } > > The alternative is making leave_mm() unconditionally imply a full > barrier. I've not looked at other sites using it though. For a quick fix, I preferred the more self-contained change. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
![]() |