Re: Excessive TLB flush ranges

Uladzislau Rezki <urezki@xxxxxxxxx> · Wed, 17 May 2023 14:15:11 +0200

On Wed, May 17, 2023 at 01:58:44PM +0200, Thomas Gleixner wrote:
> On Wed, May 17 2023 at 13:26, Uladzislau Rezki wrote:
> > On Tue, May 16, 2023 at 07:04:34PM +0200, Thomas Gleixner wrote:
> >> The proposed flush_tlb_kernel_vas(list, num_pages) mechanism
> >> achieves:
> >> 
> >>   1) It batches multiple ranges to _one_ invocation
> >> 
> >>   2) It lets the architecture decide based on the number of pages
> >>      whether it does a tlb_flush_all() or a flush of individual ranges.
> >> 
> >> Whether the architecture uses IPIs or flushes only locally and the
> >> hardware propagates that is completely irrelevant.
> >> 
> >> Right now any coalesced range, which is huge due to massive holes, takes
> >> decision #2 away.
> >> 
> >> If you want to flush individual VAs from the core vmalloc code then you
> >> lose #1, as the aggregated number of pages might justify a tlb_flush_all().
> >> 
> >> That's a pure architecture decision and all the core code needs to do is
> >> to provide appropriate information and not some completely bogus request
> >> to flush 17312759359 pages, i.e. a ~64.5 TB range, while in reality
> >> there are exactly _three_ distinct pages to flush.
> >> 
> > 1.
> >
> > I think, all two cases(logic) should be moved into ARCH code, so a decision 
> > is made _not_ by vmalloc code how to flush, either fully, if it supported or
> > page by page that require list chasing.
> 
> Which is exactly what my patch does, no?
> 
> > 2.
> > void vfree(const void *addr)
> > {
> > ...
> > 	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
> > 		vm_reset_perms(vm); <----
> > ...
> > }
> >
> > so, all purged areas are drained in a caller context, so it is blocked
> > until the drain is done including flushing. I am not sure why it is done
> > from a caller context.
> >
> > IMHO, it should be deferred same way as we do in:
> >
> > static void free_vmap_area_noflush(struct vmap_area *va)
> 
> How is that avoiding the problem? It just deferres it to some point in
> the future. There is no guarantee that batching will be large enough to
> justify a full flush.
>
> > if do not miss the point why vfree() has to do it directly.
> 
> Keeping executable mappings around until some other flush happens is
> obviously neither a brilliant idea nor correct.
> 
It avoids of blocking a caller on vfree() by deferring the freeing into
a workqueue context. At least i got the filling that "your task" that
does vfree() blocks for unacceptable time. It can happen only if it
performs VM_FLUSH_RESET_PERMS freeing(other freeing are deferred):

<snip>
if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
	vm_reset_perms(vm);
<snip>

in this case the vfree() can take some time instead of returning back to
a user asap. Is that your issue? I am not talking that TLB flushing takes
time, in this case holding on mutex also can take time.

--
Uladzislau Rezki