On 20/01/25 12:15, Uladzislau Rezki wrote: > On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote: >> On 17/01/25 17:11, Uladzislau Rezki wrote: >> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote: >> >> On 14/01/25 19:16, Jann Horn wrote: >> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@xxxxxxxxxx> wrote: >> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of >> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> >> >> flush_tlb_kernel_range() IPIs. >> >> >> >> >> >> Given that CPUs executing in userspace do not access data in the vmalloc >> >> >> range, these IPIs could be deferred until their next kernel entry. >> >> >> >> >> >> Deferral vs early entry danger zone >> >> >> =================================== >> >> >> >> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd >> >> >> and then accessed in early entry code. >> >> > >> >> > In other words, it needs a guarantee that no vmalloc allocations that >> >> > have been created in the vmalloc region while the CPU was idle can >> >> > then be accessed during early entry, right? >> >> >> >> I'm not sure if that would be a problem (not an mm expert, please do >> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >> >> deferred anyway. >> >> >> >> So after vmapping something, I wouldn't expect isolated CPUs to have >> >> invalid TLB entries for the newly vmapped page. >> >> >> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus >> >> stale TLB entries can and will remain on isolated CPUs, up until they >> >> execute the deferred flush themselves (IOW for the entire duration of the >> >> "danger zone"). >> >> >> >> Does that make sense? >> >> >> > Probably i am missing something and need to have a look at your patches, >> > but how do you guarantee that no-one map same are that you defer for TLB >> > flushing? >> > >> >> That's the cool part: I don't :') >> > Indeed, sounds unsafe :) Then we just do not need to free areas. > >> For deferring instruction patching IPIs, I (well Josh really) managed to >> get instrumentation to back me up and catch any problematic area. >> >> I looked into getting something similar for vmalloc region access in >> .noinstr code, but I didn't get anywhere. I even tried using emulated >> watchpoints on QEMU to watch the whole vmalloc range, but that went about >> as well as you could expect. >> >> That left me with staring at code. AFAICT the only vmap'd thing that is >> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which >> itself cannot be freed until the task exits - thus can't be subject to >> invalidation when a task is entering kernelspace. >> >> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?). >> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold > which can be exposed(if you need it) over sysfs for tuning. So, we can add it. > In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a single userspace application that will never enter the kernel, unless forced to by some interference (e.g. IPI sent from a housekeeping CPU). Increasing the lazy threshold would unfortunately only delay the interference - housekeeping CPUs are free to run whatever, and so they will eventually cause the lazy threshold to be hit and IPI all the CPUs, including the isolated/NOHZ_FULL ones. I was thinking maybe we could subdivide the vmap space into two regions with their own thresholds, but a task may allocate/vmap stuff while on a HK CPU and be moved to an isolated CPU afterwards, and also I still don't have any strong guarantee about what accesses an isolated CPU can do in its early entry code :( > -- > Uladzislau Rezki