Re: Excessive TLB flush ranges

Uladzislau Rezki <urezki@xxxxxxxxx> · Fri, 19 May 2023 19:02:49 +0200

On Fri, May 19, 2023 at 06:32:42PM +0200, Thomas Gleixner wrote:
> On Fri, May 19 2023 at 17:14, Uladzislau Rezki wrote:
> > On Fri, May 19, 2023 at 04:56:53PM +0200, Thomas Gleixner wrote:
> >> > +       /* Flush per-VA. */
> >> > +       list_for_each_entry(va, &local_purge_list, list)
> >> > +               flush_tlb_kernel_range(va->va_start, va->va_end);
> >> >
> >> > -       flush_tlb_kernel_range(start, end);
> >> >         resched_threshold = lazy_max_pages() << 1;
> >> 
> >> That's completely wrong, really.
> >> 
> > Absolutely. That is why we do not flush a range per-VA ;-) I provided the
> > data just to show what happens if we do it!
> 
> Seriously, you think you need to demonstrate that to me? Did you
> actually read what I wrote?
> 
>    "I understand why you want to batch and coalesce and rather do a rare
>     full tlb flush than sending gazillions of IPIs."
> 
Yes i read it. Since i also mentioned about IPI and did not provide any
data, i did it later, just in case. I shared my observation and that is it.

> > A per-VA flushing works when a system is not capable of doing a full
> > flush, so it has to do it page by page. In this scenario we should
> > bypass ranges(not mapped) which are between VAs in a purge-list.
> 
> ARM32 has a full flush as does x86. Just ARM32 does not have a cutoff
> for a full flush in flush_tlb_kernel_range(). That's easily fixable, but
> the underlying problem remains.
> 
> The point is that coalescing the VA ranges blindly is also fundamentally
> wrong:
>
> 
>        start1 = 0x95c8d000 end1 = 0x95c8e000
>        start2 = 0xf08a1000 end2 = 0xf08a5000
> 
> -->    start  = 0x95c8d000 end  = 0xf08a5000
> 
> So this ends up with:
> 
>    if (end - start > flush_all_threshold)
>    	ipi_flush_all();
>    else
>         ipi_flush_range();
> 
> So with the above example this ends up with flush_all(), but a
> flush_vas() as I demonstrated with the list approach (ignore the storage
> problem which is fixable) this results in
> 
>    if (total_nr_pages > flush_all_threshold)
>    	ipi_flush_all();
>    else
>         ipi_flush_vas();
> 
> and that ipi flushes 3 pages instead of taking out the whole TLB, which
> results in a 1% gain on that machine. Not massive, but still.
> 
> The blind coalescing is also wrong if the resulting range is not giantic
> but below the flush_all_threshold. Lets assume a threshold of 32 pages.
> 
>        start1 = 0xf0800000 end1 = 0xf0802000           2 pages
>        start2 = 0xf081e000 end2 = 0xf0820000           2 pages
> 
> -->    start  = 0xf0800000 end  = 0xf0820000
> 
> So because this does not qualify for a full flush and it should not,
> this ends up flushing 32 pages one by one instead of flushing exactly
> four.
> 
> IOW, the existing code is fully biased towards full flushes which is
> wrong.
> 
> Just because this does not show up in your performance numbers on some
> enterprise workload does not make it more correct.
> 
Usually we do a flush of lazy-areas once the lazy_max_pages() threshold
is reached. There are exceptions. When an allocation fails, we drain the
areas(if there are any), second is a per-cpu allocator and last one is
vm_reset_perms() when "vm" is marked as VM_FLUSH_RESET_PERMS.

As for your description i totally see the problem.

--
Uladzislau Rezki