Other than the obvious "remove calls to compound_head" changes, the fundamental belief here is that iterating a linked list is much slower than iterating an array (5-15x slower in my testing). There's also an associated belief that since we iterate the batch of folios three times, we do better when the array is small (ie 15 entries) than we do with a batch that is hundreds of entries long, which only gives us the opportunity for the first pages to fall out of cache by the time we get to the end. The one place where that probably falls down is "Free folios in a batch in shrink_folio_list()" where we'll flush the TLB once per batch instead of at the end. That's going to take some benchmarking. Matthew Wilcox (Oracle) (14): mm: Make folios_put() the basis of release_pages() mm: Convert free_unref_page_list() to use folios mm: Add free_unref_folios() mm: Use folios_put() in __folio_batch_release() memcg: Add mem_cgroup_uncharge_folios() mm: Remove use of folio list from folios_put() mm: Use free_unref_folios() in put_pages_list() mm: use __page_cache_release() in folios_put() mm: Handle large folios in free_unref_folios() mm: Allow non-hugetlb large folios to be batch processed mm: Free folios in a batch in shrink_folio_list() mm: Free folios directly in move_folios_to_lru() memcg: Remove mem_cgroup_uncharge_list() mm: Remove free_unref_page_list() include/linux/memcontrol.h | 24 ++--- include/linux/mm.h | 19 +--- mm/internal.h | 4 +- mm/memcontrol.c | 16 ++-- mm/mlock.c | 3 +- mm/page_alloc.c | 74 ++++++++------- mm/swap.c | 180 ++++++++++++++++++++----------------- mm/vmscan.c | 51 +++++------ 8 files changed, 181 insertions(+), 190 deletions(-) -- 2.40.1