Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list

Ryan Roberts <ryan.roberts@xxxxxxx> · Fri, 7 Jun 2024 11:35:38 +0100

On 24/05/2024 18:17, Chris Li wrote:
> Track the nonfull cluster as well as the empty cluster
> on lists. Each order has one nonfull cluster list.
> 
> The cluster will remember which order it was used during
> new cluster allocation.
> 
> When the cluster has free entry, add to the nonfull[order]
> list.  When the free cluster list is empty, also allocate
> from the nonempty list of that order.
> 
> This improves the mTHP swap allocation success rate.

If I've understood correctly, the aim here is to link all the current per-cpu
clusters for a given order together so that if a cpu can't allocate a new
cluster for a given order, then it can steal another CPU's current cluster for
that order?

If that's the intent, couldn't that be done just by iterating over the per-cpu,
per-order cluster pointers? Then you don't need all the linked list churn
(althogh I like the linked list changes as a nice cleanup, I'm not sure the
churn is neccessary for this change?). There would likely need to be some
locking considerations, but it would also allow you to get access to the next
entry within the cluster for allocation.

However, fundamentally, I don't think this change solves the problem; it just
takes a bit longer before the allocation fails. The real problem is
fragmentation due to freeing individual pages from swap entries at different times.

Wouldn't it be better to just extend scanning to support high order allocations?
Then we can steal a high order block from any cluster, even clusters that were
previously full, just like we currently do for order-0. Given we are already
falling back to this path for order-0, I don't think it would be any more
expensive; infact its less expensive because we only scan once for the high
order block, rather than scan for every split order-0 page.

Of course that still doesn't solve the proplem entirely; if swap is so
fragmented that there is no contiguous block of the required order then you
still have to fall back to splitting. As an extra optimization, you could store
the largest contiguous free space available in each cluster to avoid scanning in
case its too small?

> 
> There are limitations if the distribution of numbers of
> different orders of mTHP changes a lot. e.g. there are a lot
> of nonfull cluster assign to order A while later time there
> are a lot of order B allocation while very little allocation
> in order A. Currently the cluster used by order A will not
> reused by order B unless the cluster is 100% empty.
> 
> This situation is best addressed by the longer term "swap
> buddy allocator", in future patches.
> ---
>  include/linux/swap.h |  4 ++++
>  mm/swapfile.c        | 25 +++++++++++++++++++++++--
>  2 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0d3906eff3c9..1b7f0794b9bf 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -255,10 +255,12 @@ struct swap_cluster_info {
>  				 * cluster
>  				 */
>  	unsigned int count:16;
> +	unsigned int order:8;
>  	unsigned int flags:8;
>  	struct list_head next;
>  };
>  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>  
>  
>  /*
> @@ -297,6 +299,8 @@ struct swap_info_struct {
>  	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
>  	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>  	struct list_head free_clusters; /* free clusters list */
> +	struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> +					/* list of cluster that contains at least one free slot */
>  	unsigned int lowest_bit;	/* index of first free in swap_map */
>  	unsigned int highest_bit;	/* index of last free in swap_map */
>  	unsigned int pages;		/* total of usable pages of swap */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 205a60c5f9cb..51923aba500e 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>  
>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> +	if (ci->flags & CLUSTER_FLAG_NONFULL)
> +		list_move_tail(&ci->next, &si->free_clusters);
> +	else
> +		list_add_tail(&ci->next, &si->free_clusters);
>  	ci->flags = CLUSTER_FLAG_FREE;
> -	list_add_tail(&ci->next, &si->free_clusters);
>  }
>  
>  /*
> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>  	ci->count--;
>  
>  	if (!ci->count)
> -		free_cluster(p, ci);
> +		return free_cluster(p, ci);
> +
> +	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> +		list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]);
> +		ci->flags |= CLUSTER_FLAG_NONFULL;
> +	}
>  }
>  
>  /*
> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>  			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
>  			list_del(&ci->next);
>  			spin_lock(&ci->lock);
> +			ci->order = order;
> +			ci->flags = 0;
> +			spin_unlock(&ci->lock);
> +			tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> +		} else if (!list_empty(&si->nonfull_clusters[order])) {
> +			ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next);
> +			list_del(&ci->next);
> +			spin_lock(&ci->lock);
>  			ci->flags = 0;
>  			spin_unlock(&ci->lock);
>  			tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;

This looks wrong to me; if the cluster is on the nonfull list then it will have
had some entries already allocated (by another cpu). So pointing tmp to the
first block in the cluster will never yield a free block. The cpu from which you
are stealing the cluster stores the next free block location in its per-cpu
structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is
a better approach than the nonfull list?

Additionally, this cluster will be stored back to this cpu's current cluster at
the bottom of the function. That may or may not be what you intended.

> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>  				break;
>  			tmp += nr_pages;
>  		}
> +		WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order);
>  		unlock_cluster(ci);
>  	}
>  	if (tmp >= max) {
> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>  	ci = lock_cluster(si, offset);
>  	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>  	ci->count = 0;
> +	ci->order = 0;
>  	ci->flags = 0;
>  	free_cluster(si, ci);
>  	unlock_cluster(ci);
> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>  	INIT_LIST_HEAD(&p->free_clusters);
>  	INIT_LIST_HEAD(&p->discard_clusters);
>  
> +	for (i = 0; i < SWAP_NR_ORDERS; i++)
> +		INIT_LIST_HEAD(&p->nonfull_clusters[i]);
> +
>  	for (i = 0; i < swap_header->info.nr_badpages; i++) {
>  		unsigned int page_nr = swap_header->info.badpages[i];
>  		if (page_nr == 0 || page_nr > swap_header->info.last_page)
>