On Wed, May 17, 2023 at 06:32:25PM +0200, Thomas Gleixner wrote: > On Wed, May 17 2023 at 14:15, Uladzislau Rezki wrote: > > On Wed, May 17, 2023 at 01:58:44PM +0200, Thomas Gleixner wrote: > >> Keeping executable mappings around until some other flush happens is > >> obviously neither a brilliant idea nor correct. > >> > > It avoids of blocking a caller on vfree() by deferring the freeing into > > a workqueue context. At least i got the filling that "your task" that > > does vfree() blocks for unacceptable time. It can happen only if it > > performs VM_FLUSH_RESET_PERMS freeing(other freeing are deferred): > > > > <snip> > > if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS)) > > vm_reset_perms(vm); > > <snip> > > > > in this case the vfree() can take some time instead of returning back to > > a user asap. Is that your issue? I am not talking that TLB flushing takes > > time, in this case holding on mutex also can take time. > > This is absolutely not the problem at all. This comes via do_exit() and > I explained already here: > > https://lore.kernel.org/all/871qjg8wqe.ffs@tglx > > what made us look into this and I'm happy to quote myself for your > conveniance: > > "The scenario which made us look is that CPU0 is housekeeping and CPU1 is > isolated for RT. > > Now CPU0 does that flush nonsense and the RT workload on CPU1 suffers > because the compute time is suddenly factor 2-3 larger, IOW, it misses > the deadline. That means a one off event is already a problem." > > So it does not matter at all how long the operations on CPU0 take. The > only thing which matters is how much these operations affect the > workload on CPU1. > Thanks. I focused on your first email, where you have not mentioned your second part, explaining that you have a housekeeping CPU and another for RT activity. > > That made me look into this coalescing code. I understand why you want > to batch and coalesce and rather do a rare full tlb flush than sending > gazillions of IPIs. > Your issues has no connections with merging. But the place you looked was correct :) > > But that creates a policy at the core code which does not leave any > decision to make for the architecture, whether it's worth to do full or > single flushes. That's what I worried about and not about the question > whether that free takes 1ms or 10us. That's a completely different > debate. > > Whether that list based flush turns out to be the better solution or > not, has still to be decided by deeper analysis. > I had a look how per-VA TLB flushing behaves on x86_64 under heavy load: <snip> commit 776a33ed63f0f15b5b3f6254bcb927a45e37298d (HEAD -> master) Author: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx> Date: Fri May 19 11:35:35 2023 +0200 mm: vmalloc: Flush TLB per-va Signed-off-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx> diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 9683573f1225..6ff95f3d1fa1 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1739,15 +1739,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) if (unlikely(list_empty(&local_purge_list))) goto out; - start = min(start, - list_first_entry(&local_purge_list, - struct vmap_area, list)->va_start); + /* OK. A per-cpu wants to flush an exact range. */ + if (start != ULONG_MAX) + flush_tlb_kernel_range(start, end); - end = max(end, - list_last_entry(&local_purge_list, - struct vmap_area, list)->va_end); + /* Flush per-VA. */ + list_for_each_entry(va, &local_purge_list, list) + flush_tlb_kernel_range(va->va_start, va->va_end); - flush_tlb_kernel_range(start, end); resched_threshold = lazy_max_pages() << 1; spin_lock(&free_vmap_area_lock); <snip> There are at least two observation: 1. asm_sysvec_call_function adds extra 12% in therms of cycles # per-VA TLB flush - 12.00% native_queued_spin_lock_slowpath ▒ - 11.90% asm_sysvec_call_function ▒ - sysvec_call_function ▒ __sysvec_call_function ▒ - __flush_smp_call_function_queue ▒ - 1.64% __flush_tlb_all ▒ native_flush_tlb_global ▒ native_write_cr4 ▒ # default 0.18% 0.16% [kernel] [k] asm_sysvec_call_function 2. Memory footprint grows(under heavy load) because the TLB-flush + extra lazy-list scan take longer time. Hope it could be somehow useful for you. -- Uladzislau Rezki