Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()

Nikita Yushchenko <nikita.yushchenko@xxxxxxxxxxxxx> · Sat, 18 Dec 2021 17:31:43 +0300

This allows archs to optimize it, by
freeing multiple tables in a single release_pages() call. This is
faster than individual put_page() calls, especially with memcg
accounting enabled.

Could we quantify "faster"?  There's a non-trivial amount of code being
added here and it would be nice to back it up with some cold-hard numbers.

I currently don't have numbers for this patch taken alone. This patch originates from work done some 
years ago to reduce cost of memory accounting, and x86-only version of this patch was in 
virtuozzo/openvz kernel since then. Other patches from that work have been upstreamed, but this one was 
missed.

Still it's obvious that release_pages() shall be faster that a loop calling put_page() - isn't that 
exactly the reason why release_pages() exists and is different from a loop calling put_page()?

  static void __tlb_remove_table_free(struct mmu_table_batch *batch)
  {
-	int i;
-
-	for (i = 0; i < batch->nr; i++)
-		__tlb_remove_table(batch->tables[i]);
-
+	__tlb_remove_tables(batch->tables, batch->nr);
  	free_page((unsigned long)batch);
  }

This leaves a single call-site for __tlb_remove_table():

static void tlb_remove_table_one(void *table)
{
         tlb_remove_table_sync_one();
         __tlb_remove_table(table);
}

Is that worth it, or could it just be:

	__tlb_remove_tables(&table, 1);

I was considering that while preparing the patch, however that resulted into even larger change in 
archs, due to removal of non-batched call, and I decided not to follow this way.

And, Peter's suggestion to integrate free_page_and_swap()-based implementation of __tlb_remove_table() 
into mm/mmu_gather.c under ifdef, and then do the optimization locally in mm/mmu_gather.c, looks better.

+void free_pages_and_swap_cache_nolru(struct page **pages, int nr)
+{
+	__free_pages_and_swap_cache(pages, nr, false);
  }

This went unmentioned in the changelog.  But, it seems like there's a
specific optimization here.  In the exiting code,
free_pages_and_swap_cache() is wasteful if no page in pages[] is on the
LRU.  It doesn't need the lru_add_drain().

This is a somewhat different topic.

In scope of this patch, the _nolru version was added because there was no lru draining in the looped 
call to __tlb_remove_table(). Having it added to the batched version, although won't break things, does 
add overhead that was not there before, which is in direct conflict with the original goal.

If the version with draining lru is indeed not needed, it can be cleaned out in scope of a different 
patchset.

		if (!do_lru)
			VM_WARN_ON_ONCE_PAGE(PageLRU(pagep[i]),
					     pagep[i]);
		free_swap_cache(...);

This looks like a good safety measure, will add it.

But, even more than that, do all the architectures even need the
free_swap_cache()?

I was under impression that process page tables are a valid target for swapping out. Although I can be 
wrong here.

Nikita