On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote: > On 17/01/25 17:11, Uladzislau Rezki wrote: > > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote: > >> On 14/01/25 19:16, Jann Horn wrote: > >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@xxxxxxxxxx> wrote: > >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of > >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the > >> >> flush_tlb_kernel_range() IPIs. > >> >> > >> >> Given that CPUs executing in userspace do not access data in the vmalloc > >> >> range, these IPIs could be deferred until their next kernel entry. > >> >> > >> >> Deferral vs early entry danger zone > >> >> =================================== > >> >> > >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd > >> >> and then accessed in early entry code. > >> > > >> > In other words, it needs a guarantee that no vmalloc allocations that > >> > have been created in the vmalloc region while the CPU was idle can > >> > then be accessed during early entry, right? > >> > >> I'm not sure if that would be a problem (not an mm expert, please do > >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't > >> deferred anyway. > >> > >> So after vmapping something, I wouldn't expect isolated CPUs to have > >> invalid TLB entries for the newly vmapped page. > >> > >> However, upon vunmap'ing something, the TLB flush is deferred, and thus > >> stale TLB entries can and will remain on isolated CPUs, up until they > >> execute the deferred flush themselves (IOW for the entire duration of the > >> "danger zone"). > >> > >> Does that make sense? > >> > > Probably i am missing something and need to have a look at your patches, > > but how do you guarantee that no-one map same are that you defer for TLB > > flushing? > > > > That's the cool part: I don't :') > Indeed, sounds unsafe :) Then we just do not need to free areas. > For deferring instruction patching IPIs, I (well Josh really) managed to > get instrumentation to back me up and catch any problematic area. > > I looked into getting something similar for vmalloc region access in > .noinstr code, but I didn't get anywhere. I even tried using emulated > watchpoints on QEMU to watch the whole vmalloc range, but that went about > as well as you could expect. > > That left me with staring at code. AFAICT the only vmap'd thing that is > accessed during early entry is the task stack (CONFIG_VMAP_STACK), which > itself cannot be freed until the task exits - thus can't be subject to > invalidation when a task is entering kernelspace. > > If you have any tracing/instrumentation suggestions, I'm all ears (eyes?). > As noted before, we defer flushing for vmalloc. We have a lazy-threshold which can be exposed(if you need it) over sysfs for tuning. So, we can add it. -- Uladzislau Rezki