On Thu, Oct 21, 2021 at 09:25:43PM +0200, Peter Zijlstra wrote: > On Thu, Oct 21, 2021 at 03:39:35PM -0300, Marcelo Tosatti wrote: > > Peter, > > > > static __always_inline void arch_exit_to_user_mode(void) > > { > > mds_user_clear_cpu_buffers(); > > } > > > > /** > > * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability > > * > > * Clear CPU buffers if the corresponding static key is enabled > > */ > > static __always_inline void mds_user_clear_cpu_buffers(void) > > { > > if (static_branch_likely(&mds_user_clear)) > > mds_clear_cpu_buffers(); > > } > > > > We were discussing how to perform objtool style validation > > that no code after the check for > > I'm not sure what the point of the above is... Were you trying to ask > for validation that nothing runs after the mds_user_clear_cpu_buffer()? > > That isn't strictly true today, there's lockdep code after it. I can't > recall why that order is as it is though. > > Pretty much everything in noinstr is magical, we just have to think > harder there (and possibly start writing more comments there). mds_user_clear_cpu_buffers happens after sync_core, in your patchset, if i am not mistaken. > > > + /* NMI happens here and must still do/finish CT_WORK_n */ > > > + sync_core(); > > > > But after the discussion with you, it seems doing the TLB checking > > and (also sync_core) checking very late/very early on exit/entry > > makes things easier to review. > > I don't know about late, it must happen *very* early in entry. The > sync_core() must happen before any self-modifying code gets called > (static_branch, static_call, etc..) with possible exception of the > context_tracking static_branch. > > The TLBi must also happen super early, possibly while still on the > entry stack (since the task stack is vmap'ed). But will it be ever be freed/remapped from other CPUs while the task is running? > We currently don't run C > code on the entry stack, that needs quite a bit of careful work to make > happen. Was thinking of coding in ASM after (as early as possible) the write to switch to kernel CR3: Kernel entry: ------------- cpu = smp_processor_id(); if (isolation_enabled(cpu)) { reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); if (reqs & CPU_REQ_FLUSH_TLB) flush_tlb_all(); if (reqs & CPU_REQ_SYNC_CORE) sync_core(); } Exit to userspace (as close to write to CR3 with user pagetable pointer): ----------------- cpu = smp_processor_id(); if (isolation_enabled(cpu)) { atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); } You think that is a bad idea (in ASM, not C) ? And request side can be in C: Request side: ------------- int targetcpu; do { struct percpudata *pcpudata = per_cpu(&percpudata, targetcpu); old_state = pcpudata->user_kernel_state; /* in kernel mode ? */ if (!(old_state & IN_USER_MODE)) { smp_call_function_single(request_fn, targetcpu, 1); break; } new_state = remote_state | CPU_REQ_FLUSH_TLB; // (or CPU_REQ_INV_ICACHE) } while (atomic_cmpxchg(&pcpudata->user_kernel_state, old_state, new_state) != old_state); (need logic to protect from atomic_cmpxchg always failing, but shouldnt be difficult). > > Can then use a single atomic variable with USER/KERNEL state and cmpxchg > > loops. > > We're not going to add an atomic to context tracking. There is one, we > just got to extract/share it with RCU. Again, to avoid kernel TLB flushes you'd have to ensure: kernel entry: instrA addr1,addr2,addr3 instrB addr2,addr3,addr4 <--- that no address here has TLBs modified and flushed instrC addr5,addr6,addr7 reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); if (reqs & CPU_REQ_FLUSH_TLB) flush_tlb_all(); kernel exit: atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); instrA addr1,addr2,addr3 instrB addr2,addr3,addr4 <--- that no address here has TLBs modified and flushed This could be conditional on "task isolated mode" enabled (would be better if it didnt, though).