On 16/10/2023 07:17, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@xxxxxxx> writes: > >> On 11/10/2023 11:36, Ryan Roberts wrote: >>> On 11/10/2023 09:25, Huang, Ying wrote: >>>> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >>>> >>>>> The upcoming anonymous small-sized THP feature enables performance >>>>> improvements by allocating large folios for anonymous memory. However >>>>> I've observed that on an arm64 system running a parallel workload (e.g. >>>>> kernel compilation) across many cores, under high memory pressure, the >>>>> speed regresses. This is due to bottlenecking on the increased number of >>>>> TLBIs added due to all the extra folio splitting. >>>>> >>>>> Therefore, solve this regression by adding support for swapping out >>>>> small-sized THP without needing to split the folio, just like is already >>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is >>>>> enabled, and when the swap backing store is a non-rotating block device >>>>> - these are the same constraints as for the existing PMD-sized THP >>>>> swap-out support. >>>>> >>>>> Note that no attempt is made to swap-in THP here - this is still done >>>>> page-by-page, like for PMD-sized THP. >>>>> >>>>> The main change here is to improve the swap entry allocator so that it >>>>> can allocate any power-of-2 number of contiguous entries between [4, (1 >>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct >>>>> order and allocating sequentially from it until the cluster is full. >>>>> This ensures that we don't need to search the map and we get no >>>>> fragmentation due to alignment padding for different orders in the >>>>> cluster. If there is no current cluster for a given order, we attempt to >>>>> allocate a free cluster from the list. If there are no free clusters, we >>>>> fail the allocation and the caller falls back to splitting the folio and >>>>> allocates individual entries (as per existing PMD-sized THP fallback). >>>>> >>>>> As far as I can tell, this should not cause any extra fragmentation >>>>> concerns, given how similar it is to the existing PMD-sized THP >>>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in >>>>> concurrent use though. In practice, the number of orders in use will be >>>>> small though. >>>>> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>>>> --- >>>>> include/linux/swap.h | 7 ++++++ >>>>> mm/swapfile.c | 60 +++++++++++++++++++++++++++++++++----------- >>>>> mm/vmscan.c | 10 +++++--- >>>>> 3 files changed, 59 insertions(+), 18 deletions(-) >>>>> >>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>>> index a073366a227c..fc55b760aeff 100644 >>>>> --- a/include/linux/swap.h >>>>> +++ b/include/linux/swap.h >>>>> @@ -320,6 +320,13 @@ struct swap_info_struct { >>>>> */ >>>>> struct work_struct discard_work; /* discard worker */ >>>>> struct swap_cluster_list discard_clusters; /* discard clusters list */ >>>>> + unsigned int large_next[PMD_ORDER]; /* >>>>> + * next free offset within current >>>>> + * allocation cluster for large >>>>> + * folios, or UINT_MAX if no current >>>>> + * cluster. Index is (order - 1). >>>>> + * Only when cluster_info is used. >>>>> + */ >>>> >>>> I think that it is better to make this per-CPU. That is, extend the >>>> percpu_cluster mechanism. Otherwise, we may have scalability issue. >>> >>> Is your concern that the swap_info spinlock will get too contended as its >>> currently written? From briefly looking at percpu_cluster, it looks like that >>> spinlock is always held when accessing the per-cpu structures - presumably >>> that's what's disabling preemption and making sure the thread is not migrated? >>> So I'm not sure what the benefit is currently? Surely you want to just disable >>> preemption but not hold the lock? I'm sure I've missed something crucial... >> >> I looked a bit further at how to implement what you are suggesting. >> get_swap_pages() is currently taking the swap_info lock which it needs to check >> and update some other parts of the swap_info - I'm not sure that part can be >> removed. swap_alloc_large() (my new function) is not doing an awful lot of work, >> so I'm not convinced that you would save too much by releasing the lock for that >> part. In contrast there is a lot more going on in scan_swap_map_slots() so there >> is more benefit to releasing the lock and using the percpu stuff - correct me if >> I've missunderstood. >> >> As an alternative approach, perhaps it makes more sense to beef up the caching >> layer in swap_slots.c to handle large folios too? Then you avoid taking the >> swap_info lock at all most of the time, like you currently do for single entry >> allocations. >> >> What do you think? > > Sorry for late reply. > > percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster > allocation per-cpu"). Please check the changelog for why it's > introduced. Sorry about my incorrect memory about scalability. > percpu_cluster is introduced for disk performance mainly instead of > scalability. Thanks for the pointer. I'm not sure if you are still suggesting that I make my small-sized THP allocation mechanism per-cpu though? I anticipate that by virtue of allocating multiple contiguous swap entries for a small-sized THP we already get a lot of the benefits that percpu_cluster gives order-0 allocations. (Although obviously it will only give contiguity matching the size of the THP rather than a full cluster). The downside of making this percpu would be keeping more free clusters tied up in the percpu caches, potentially causing a need to scan for free entries more often. What's your view? Thanks, Ryan > > -- > Best Regards, > Huang, Ying > >>> >>>> >>>> And this should be enclosed in CONFIG_THP_SWAP. >>> >>> Yes, I'll fix this in the next version. >>> >>> Thanks for the review! >>> >>>> >>>>> struct plist_node avail_lists[]; /* >>>>> * entries in swap_avail_heads, one >>>>> * entry per node.