Ryan Roberts <ryan.roberts@xxxxxxx> writes: > On 11/10/2023 11:36, Ryan Roberts wrote: >> On 11/10/2023 09:25, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >>> >>>> The upcoming anonymous small-sized THP feature enables performance >>>> improvements by allocating large folios for anonymous memory. However >>>> I've observed that on an arm64 system running a parallel workload (e.g. >>>> kernel compilation) across many cores, under high memory pressure, the >>>> speed regresses. This is due to bottlenecking on the increased number of >>>> TLBIs added due to all the extra folio splitting. >>>> >>>> Therefore, solve this regression by adding support for swapping out >>>> small-sized THP without needing to split the folio, just like is already >>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is >>>> enabled, and when the swap backing store is a non-rotating block device >>>> - these are the same constraints as for the existing PMD-sized THP >>>> swap-out support. >>>> >>>> Note that no attempt is made to swap-in THP here - this is still done >>>> page-by-page, like for PMD-sized THP. >>>> >>>> The main change here is to improve the swap entry allocator so that it >>>> can allocate any power-of-2 number of contiguous entries between [4, (1 >>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct >>>> order and allocating sequentially from it until the cluster is full. >>>> This ensures that we don't need to search the map and we get no >>>> fragmentation due to alignment padding for different orders in the >>>> cluster. If there is no current cluster for a given order, we attempt to >>>> allocate a free cluster from the list. If there are no free clusters, we >>>> fail the allocation and the caller falls back to splitting the folio and >>>> allocates individual entries (as per existing PMD-sized THP fallback). >>>> >>>> As far as I can tell, this should not cause any extra fragmentation >>>> concerns, given how similar it is to the existing PMD-sized THP >>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in >>>> concurrent use though. In practice, the number of orders in use will be >>>> small though. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>>> --- >>>> include/linux/swap.h | 7 ++++++ >>>> mm/swapfile.c | 60 +++++++++++++++++++++++++++++++++----------- >>>> mm/vmscan.c | 10 +++++--- >>>> 3 files changed, 59 insertions(+), 18 deletions(-) >>>> >>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>> index a073366a227c..fc55b760aeff 100644 >>>> --- a/include/linux/swap.h >>>> +++ b/include/linux/swap.h >>>> @@ -320,6 +320,13 @@ struct swap_info_struct { >>>> */ >>>> struct work_struct discard_work; /* discard worker */ >>>> struct swap_cluster_list discard_clusters; /* discard clusters list */ >>>> + unsigned int large_next[PMD_ORDER]; /* >>>> + * next free offset within current >>>> + * allocation cluster for large >>>> + * folios, or UINT_MAX if no current >>>> + * cluster. Index is (order - 1). >>>> + * Only when cluster_info is used. >>>> + */ >>> >>> I think that it is better to make this per-CPU. That is, extend the >>> percpu_cluster mechanism. Otherwise, we may have scalability issue. >> >> Is your concern that the swap_info spinlock will get too contended as its >> currently written? From briefly looking at percpu_cluster, it looks like that >> spinlock is always held when accessing the per-cpu structures - presumably >> that's what's disabling preemption and making sure the thread is not migrated? >> So I'm not sure what the benefit is currently? Surely you want to just disable >> preemption but not hold the lock? I'm sure I've missed something crucial... > > I looked a bit further at how to implement what you are suggesting. > get_swap_pages() is currently taking the swap_info lock which it needs to check > and update some other parts of the swap_info - I'm not sure that part can be > removed. swap_alloc_large() (my new function) is not doing an awful lot of work, > so I'm not convinced that you would save too much by releasing the lock for that > part. In contrast there is a lot more going on in scan_swap_map_slots() so there > is more benefit to releasing the lock and using the percpu stuff - correct me if > I've missunderstood. > > As an alternative approach, perhaps it makes more sense to beef up the caching > layer in swap_slots.c to handle large folios too? Then you avoid taking the > swap_info lock at all most of the time, like you currently do for single entry > allocations. > > What do you think? Sorry for late reply. percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster allocation per-cpu"). Please check the changelog for why it's introduced. Sorry about my incorrect memory about scalability. percpu_cluster is introduced for disk performance mainly instead of scalability. -- Best Regards, Huang, Ying >> >>> >>> And this should be enclosed in CONFIG_THP_SWAP. >> >> Yes, I'll fix this in the next version. >> >> Thanks for the review! >> >>> >>>> struct plist_node avail_lists[]; /* >>>> * entries in swap_avail_heads, one >>>> * entry per node.