Re: Excessive TLB flush ranges

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Fri, 19 May 2023 16:56:53 +0200

On Fri, May 19 2023 at 12:01, Uladzislau Rezki wrote:
> On Wed, May 17, 2023 at 06:32:25PM +0200, Thomas Gleixner wrote:
>> That made me look into this coalescing code. I understand why you want
>> to batch and coalesce and rather do a rare full tlb flush than sending
>> gazillions of IPIs.
>> 
> Your issues has no connections with merging. But the place you looked
> was correct :)

I'm not talking about merging. I'm talking about coalescing ranges.

       start = 0x95c8d000 end = 0x95c8e000

plus the VA from list which has

       start = 0xf08a1000 end = 0xf08a5000

which results in a flush range of:

       start = 0x95c8d000 end = 0xf08a5000
No?

> @@ -1739,15 +1739,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>         if (unlikely(list_empty(&local_purge_list)))
>                 goto out;
>
> -       start = min(start,
> -               list_first_entry(&local_purge_list,
> -                       struct vmap_area, list)->va_start);
> +       /* OK. A per-cpu wants to flush an exact range. */
> +       if (start != ULONG_MAX)
> +               flush_tlb_kernel_range(start, end);
>
> -       end = max(end,
> -               list_last_entry(&local_purge_list,
> -                       struct vmap_area, list)->va_end);
> +       /* Flush per-VA. */
> +       list_for_each_entry(va, &local_purge_list, list)
> +               flush_tlb_kernel_range(va->va_start, va->va_end);
>
> -       flush_tlb_kernel_range(start, end);
>         resched_threshold = lazy_max_pages() << 1;

That's completely wrong, really.

For the above case, which is easily enough to reproduce, this ends up
doing TWO IPIs on x86, which is worse than ONE IPI which ends up with a
flush all.

Aside of that if there are two VAs in the purge list and both are over
the threshold for doing a full flush then you end up with TWO flush all
IPIs in a row, which completely defeats the purpose of this whole
exercise.

As I demonstrated with the list approach for the above scenario this
avoids a full flush and needs only one IPI. Nadavs observation vs. the
list aside, this is clearly better than what you are proposing here.

The IPI cost on x86 is equally bad as the full barriers on arm[64].

Thanks,

        tglx