On Fri, May 19, 2023 at 06:32:42PM +0200, Thomas Gleixner wrote: > On Fri, May 19 2023 at 17:14, Uladzislau Rezki wrote: > > On Fri, May 19, 2023 at 04:56:53PM +0200, Thomas Gleixner wrote: > >> > + /* Flush per-VA. */ > >> > + list_for_each_entry(va, &local_purge_list, list) > >> > + flush_tlb_kernel_range(va->va_start, va->va_end); > >> > > >> > - flush_tlb_kernel_range(start, end); > >> > resched_threshold = lazy_max_pages() << 1; > >> > >> That's completely wrong, really. > >> > > Absolutely. That is why we do not flush a range per-VA ;-) I provided the > > data just to show what happens if we do it! > > Seriously, you think you need to demonstrate that to me? Did you > actually read what I wrote? > > "I understand why you want to batch and coalesce and rather do a rare > full tlb flush than sending gazillions of IPIs." > Yes i read it. Since i also mentioned about IPI and did not provide any data, i did it later, just in case. I shared my observation and that is it. > > A per-VA flushing works when a system is not capable of doing a full > > flush, so it has to do it page by page. In this scenario we should > > bypass ranges(not mapped) which are between VAs in a purge-list. > > ARM32 has a full flush as does x86. Just ARM32 does not have a cutoff > for a full flush in flush_tlb_kernel_range(). That's easily fixable, but > the underlying problem remains. > > The point is that coalescing the VA ranges blindly is also fundamentally > wrong: > > > start1 = 0x95c8d000 end1 = 0x95c8e000 > start2 = 0xf08a1000 end2 = 0xf08a5000 > > --> start = 0x95c8d000 end = 0xf08a5000 > > So this ends up with: > > if (end - start > flush_all_threshold) > ipi_flush_all(); > else > ipi_flush_range(); > > So with the above example this ends up with flush_all(), but a > flush_vas() as I demonstrated with the list approach (ignore the storage > problem which is fixable) this results in > > if (total_nr_pages > flush_all_threshold) > ipi_flush_all(); > else > ipi_flush_vas(); > > and that ipi flushes 3 pages instead of taking out the whole TLB, which > results in a 1% gain on that machine. Not massive, but still. > > The blind coalescing is also wrong if the resulting range is not giantic > but below the flush_all_threshold. Lets assume a threshold of 32 pages. > > start1 = 0xf0800000 end1 = 0xf0802000 2 pages > start2 = 0xf081e000 end2 = 0xf0820000 2 pages > > --> start = 0xf0800000 end = 0xf0820000 > > So because this does not qualify for a full flush and it should not, > this ends up flushing 32 pages one by one instead of flushing exactly > four. > > IOW, the existing code is fully biased towards full flushes which is > wrong. > > Just because this does not show up in your performance numbers on some > enterprise workload does not make it more correct. > Usually we do a flush of lazy-areas once the lazy_max_pages() threshold is reached. There are exceptions. When an allocation fails, we drain the areas(if there are any), second is a per-cpu allocator and last one is vm_reset_perms() when "vm" is marked as VM_FLUSH_RESET_PERMS. As for your description i totally see the problem. -- Uladzislau Rezki