On 11/10/2023 09:25, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@xxxxxxx> writes: > >> The upcoming anonymous small-sized THP feature enables performance >> improvements by allocating large folios for anonymous memory. However >> I've observed that on an arm64 system running a parallel workload (e.g. >> kernel compilation) across many cores, under high memory pressure, the >> speed regresses. This is due to bottlenecking on the increased number of >> TLBIs added due to all the extra folio splitting. >> >> Therefore, solve this regression by adding support for swapping out >> small-sized THP without needing to split the folio, just like is already >> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is >> enabled, and when the swap backing store is a non-rotating block device >> - these are the same constraints as for the existing PMD-sized THP >> swap-out support. >> >> Note that no attempt is made to swap-in THP here - this is still done >> page-by-page, like for PMD-sized THP. >> >> The main change here is to improve the swap entry allocator so that it >> can allocate any power-of-2 number of contiguous entries between [4, (1 >> << PMD_ORDER)]. This is done by allocating a cluster for each distinct >> order and allocating sequentially from it until the cluster is full. >> This ensures that we don't need to search the map and we get no >> fragmentation due to alignment padding for different orders in the >> cluster. If there is no current cluster for a given order, we attempt to >> allocate a free cluster from the list. If there are no free clusters, we >> fail the allocation and the caller falls back to splitting the folio and >> allocates individual entries (as per existing PMD-sized THP fallback). >> >> As far as I can tell, this should not cause any extra fragmentation >> concerns, given how similar it is to the existing PMD-sized THP >> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in >> concurrent use though. In practice, the number of orders in use will be >> small though. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >> --- >> include/linux/swap.h | 7 ++++++ >> mm/swapfile.c | 60 +++++++++++++++++++++++++++++++++----------- >> mm/vmscan.c | 10 +++++--- >> 3 files changed, 59 insertions(+), 18 deletions(-) >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index a073366a227c..fc55b760aeff 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -320,6 +320,13 @@ struct swap_info_struct { >> */ >> struct work_struct discard_work; /* discard worker */ >> struct swap_cluster_list discard_clusters; /* discard clusters list */ >> + unsigned int large_next[PMD_ORDER]; /* >> + * next free offset within current >> + * allocation cluster for large >> + * folios, or UINT_MAX if no current >> + * cluster. Index is (order - 1). >> + * Only when cluster_info is used. >> + */ > > I think that it is better to make this per-CPU. That is, extend the > percpu_cluster mechanism. Otherwise, we may have scalability issue. Is your concern that the swap_info spinlock will get too contended as its currently written? From briefly looking at percpu_cluster, it looks like that spinlock is always held when accessing the per-cpu structures - presumably that's what's disabling preemption and making sure the thread is not migrated? So I'm not sure what the benefit is currently? Surely you want to just disable preemption but not hold the lock? I'm sure I've missed something crucial... > > And this should be enclosed in CONFIG_THP_SWAP. Yes, I'll fix this in the next version. Thanks for the review! > >> struct plist_node avail_lists[]; /* >> * entries in swap_avail_heads, one >> * entry per node. > > -- > Best Regards, > Huang, Ying