> I came up with a kernel patch that I *think* may reproduce the problem > with enough iterations. Userspace only needs to enable LAM, so I think > the selftest can be enough to trigger it. > > However, there is no hardware with LAM at my disposal, and IIUC I cannot > use QEMU without KVM to run a kernel with LAM. I was planning to do more > testing before sending a non-RFC version, but apparently I cannot do > any testing beyond building at this point (including reproducing) :/ > > Let me know how you want to proceed. I can send a non-RFC v1 based on > the feedback I got on the RFC, but it will only be build tested. > > For the record, here is the diff that I *think* may reproduce the bug: Okay, I was actually able to run _some_ testing with the diff below on _a kernel_, and I hit the BUG_ON pretty quickly. If I did things correctly, this BUG_ON means that even though we have an outdated LAM in our CR3, we will not update CR3 because the TLB is up-to-date. I can work on a v1 now with the IPI approach that Andy suggested. A small kink is that we may still hit the BUG_ON with that fix, but in that case it should be fine to not write CR3 because once we re-enable interrupts we will receive the IPI and fix it. IOW, the diff below will still BUG with the proposed fix, but it should be okay. One thing I am not clear about with the IPI approach, if we use mm_cpumask() to limit the IPI scope, we need to make sure that we read mm_lam_cr3_mask() *after* we update the cpumask in switch_mm_irqs_off(), which makes me think we'll need a barrier (and Andy said we want to avoid those in this path). But looking at the code I see: /* * Start remote flushes and then read tlb_gen. */ if (next != &init_mm) cpumask_set_cpu(cpu, mm_cpumask(next)); next_tlb_gen = atomic64_read(&next->context.tlb_gen); This code doesn't have a barrier. How do we make sure the read actually happens after the write? If no barrier is needed there, then I think we can similarly just read the LAM mask after cpumask_set_cpu(). > > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c > index 33b268747bb7b..c37a8c26a3c21 100644 > --- a/arch/x86/kernel/process_64.c > +++ b/arch/x86/kernel/process_64.c > @@ -750,8 +750,25 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr) > > #define LAM_U57_BITS 6 > > +static int kthread_fn(void *_mm) > +{ > + struct mm_struct *mm = _mm; > + > + /* > + * Wait for LAM to be enabled then schedule. Hopefully we will context > + * switch directly into the task that enabled LAM due to CPU pinning. > + */ > + kthread_use_mm(mm); > + while (!test_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags)); > + schedule(); > + return 0; > +} > + > static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits) > { > + struct task_struct *kthread_task; > + int kthread_cpu; > + > if (!cpu_feature_enabled(X86_FEATURE_LAM)) > return -ENODEV; > > @@ -782,10 +799,22 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits) > return -EINVAL; > } > > + /* Pin the task to the current CPU */ > + set_cpus_allowed_ptr(current, cpumask_of(smp_processor_id())); > + > + /* Run a kthread on another CPU and wait for it to start */ > + kthread_cpu = cpumask_next_wrap(smp_processor_id(), cpu_online_mask, 0, false), > + kthread_task = kthread_run_on_cpu(kthread_fn, mm, kthread_cpu, "lam_repro_kthread"); > + while (!task_is_running(kthread_task)); > + > write_cr3(__read_cr3() | mm->context.lam_cr3_mask); > set_tlbstate_lam_mode(mm); > set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags); > > + /* Move the task to the kthread CPU */ > + set_cpus_allowed_ptr(current, cpumask_of(kthread_cpu)); > + > mmap_write_unlock(mm); > > return 0; > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 51f9f56941058..3afb53f1a1901 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -593,7 +593,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, > next_tlb_gen = atomic64_read(&next->context.tlb_gen); > if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == > next_tlb_gen) > - return; > + BUG_ON(new_lam != tlbstate_lam_cr3_mask()); > > /* > * TLB contents went out of date while we were in lazy >