Re: Excessive TLB flush ranges

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Fri, 19 May 2023 18:32:42 +0200

On Fri, May 19 2023 at 17:14, Uladzislau Rezki wrote:
> On Fri, May 19, 2023 at 04:56:53PM +0200, Thomas Gleixner wrote:
>> > +       /* Flush per-VA. */
>> > +       list_for_each_entry(va, &local_purge_list, list)
>> > +               flush_tlb_kernel_range(va->va_start, va->va_end);
>> >
>> > -       flush_tlb_kernel_range(start, end);
>> >         resched_threshold = lazy_max_pages() << 1;
>> 
>> That's completely wrong, really.
>> 
> Absolutely. That is why we do not flush a range per-VA ;-) I provided the
> data just to show what happens if we do it!

Seriously, you think you need to demonstrate that to me? Did you
actually read what I wrote?

   "I understand why you want to batch and coalesce and rather do a rare
    full tlb flush than sending gazillions of IPIs."

> A per-VA flushing works when a system is not capable of doing a full
> flush, so it has to do it page by page. In this scenario we should
> bypass ranges(not mapped) which are between VAs in a purge-list.

ARM32 has a full flush as does x86. Just ARM32 does not have a cutoff
for a full flush in flush_tlb_kernel_range(). That's easily fixable, but
the underlying problem remains.

The point is that coalescing the VA ranges blindly is also fundamentally
wrong:

       start1 = 0x95c8d000 end1 = 0x95c8e000
       start2 = 0xf08a1000 end2 = 0xf08a5000

-->    start  = 0x95c8d000 end  = 0xf08a5000

So this ends up with:

   if (end - start > flush_all_threshold)
   	ipi_flush_all();
   else
        ipi_flush_range();

So with the above example this ends up with flush_all(), but a
flush_vas() as I demonstrated with the list approach (ignore the storage
problem which is fixable) this results in

   if (total_nr_pages > flush_all_threshold)
   	ipi_flush_all();
   else
        ipi_flush_vas();

and that ipi flushes 3 pages instead of taking out the whole TLB, which
results in a 1% gain on that machine. Not massive, but still.

The blind coalescing is also wrong if the resulting range is not giantic
but below the flush_all_threshold. Lets assume a threshold of 32 pages.

       start1 = 0xf0800000 end1 = 0xf0802000           2 pages
       start2 = 0xf081e000 end2 = 0xf0820000           2 pages

-->    start  = 0xf0800000 end  = 0xf0820000

So because this does not qualify for a full flush and it should not,
this ends up flushing 32 pages one by one instead of flushing exactly
four.

IOW, the existing code is fully biased towards full flushes which is
wrong.

Just because this does not show up in your performance numbers on some
enterprise workload does not make it more correct.

Thanks,

        tglx