On 12/17/21 12:19 AM, Nikita Yushchenko wrote: > When batched page table freeing via struct mmu_table_batch is used, the > final freeing in __tlb_remove_table_free() executes a loop, calling > arch hook __tlb_remove_table() to free each table individually. > > Shift that loop down to archs. This allows archs to optimize it, by > freeing multiple tables in a single release_pages() call. This is > faster than individual put_page() calls, especially with memcg > accounting enabled. Could we quantify "faster"? There's a non-trivial amount of code being added here and it would be nice to back it up with some cold-hard numbers. > --- a/mm/mmu_gather.c > +++ b/mm/mmu_gather.c > @@ -95,11 +95,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_ > > static void __tlb_remove_table_free(struct mmu_table_batch *batch) > { > - int i; > - > - for (i = 0; i < batch->nr; i++) > - __tlb_remove_table(batch->tables[i]); > - > + __tlb_remove_tables(batch->tables, batch->nr); > free_page((unsigned long)batch); > } This leaves a single call-site for __tlb_remove_table(): > static void tlb_remove_table_one(void *table) > { > tlb_remove_table_sync_one(); > __tlb_remove_table(table); > } Is that worth it, or could it just be: __tlb_remove_tables(&table, 1); ? > -void free_pages_and_swap_cache(struct page **pages, int nr) > +static void __free_pages_and_swap_cache(struct page **pages, int nr, > + bool do_lru) > { > - struct page **pagep = pages; > int i; > > - lru_add_drain(); > + if (do_lru) > + lru_add_drain(); > for (i = 0; i < nr; i++) > - free_swap_cache(pagep[i]); > - release_pages(pagep, nr); > + free_swap_cache(pages[i]); > + release_pages(pages, nr); > +} > + > +void free_pages_and_swap_cache(struct page **pages, int nr) > +{ > + __free_pages_and_swap_cache(pages, nr, true); > +} > + > +void free_pages_and_swap_cache_nolru(struct page **pages, int nr) > +{ > + __free_pages_and_swap_cache(pages, nr, false); > } This went unmentioned in the changelog. But, it seems like there's a specific optimization here. In the exiting code, free_pages_and_swap_cache() is wasteful if no page in pages[] is on the LRU. It doesn't need the lru_add_drain(). Any code that knows it is freeing all non-LRU pages can call free_pages_and_swap_cache_nolru() which should perform better than free_pages_and_swap_cache(). Should we add this to the for loop in __free_pages_and_swap_cache()? for (i = 0; i < nr; i++) { if (!do_lru) VM_WARN_ON_ONCE_PAGE(PageLRU(pagep[i]), pagep[i]); free_swap_cache(...); } But, even more than that, do all the architectures even need the free_swap_cache()? PageSwapCache() will always be false on x86, which makes the loop kinda silly. x86 could, for instance, just do: static inline void __tlb_remove_tables(void **tables, int nr) { release_pages((struct page **)tables, nr); } I _think_ this will work everywhere that has whole pages as page tables. Taking that one step further, what if we only had one generic: static inline void tlb_remove_tables(void **tables, int nr) { int i; #ifdef ARCH_PAGE_TABLES_ARE_FULL_PAGE release_pages((struct page **)tables, nr); #else arch_tlb_remove_tables(tables, i); #endif } Architectures that set ARCH_PAGE_TABLES_ARE_FULL_PAGE (or whatever) don't need to implement __tlb_remove_table() at all *and* can do release_pages() directly. This avoids all the confusion with the swap cache and LRU naming.