On Fri, Jun 7, 2024 at 3:57 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 07/06/2024 11:35, Ryan Roberts wrote: > > On 24/05/2024 18:17, Chris Li wrote: > >> Track the nonfull cluster as well as the empty cluster > >> on lists. Each order has one nonfull cluster list. > >> > >> The cluster will remember which order it was used during > >> new cluster allocation. > >> > >> When the cluster has free entry, add to the nonfull[order] > >> list. When the free cluster list is empty, also allocate > >> from the nonempty list of that order. > >> > >> This improves the mTHP swap allocation success rate. > > > > If I've understood correctly, the aim here is to link all the current per-cpu > > clusters for a given order together so that if a cpu can't allocate a new > > cluster for a given order, then it can steal another CPU's current cluster for > > that order? > > > > If that's the intent, couldn't that be done just by iterating over the per-cpu, > > per-order cluster pointers? Then you don't need all the linked list churn > > (althogh I like the linked list changes as a nice cleanup, I'm not sure the > > churn is neccessary for this change?). There would likely need to be some > > locking considerations, but it would also allow you to get access to the next > > entry within the cluster for allocation. > > > > However, fundamentally, I don't think this change solves the problem; it just > > takes a bit longer before the allocation fails. The real problem is > > fragmentation due to freeing individual pages from swap entries at different times. > > > > Wouldn't it be better to just extend scanning to support high order allocations? > > Then we can steal a high order block from any cluster, even clusters that were > > previously full, just like we currently do for order-0. Given we are already > > falling back to this path for order-0, I don't think it would be any more > > expensive; infact its less expensive because we only scan once for the high > > order block, rather than scan for every split order-0 page. > > > > Of course that still doesn't solve the proplem entirely; if swap is so > > fragmented that there is no contiguous block of the required order then you > > still have to fall back to splitting. As an extra optimization, you could store > > the largest contiguous free space available in each cluster to avoid scanning in > > case its too small? > > > > > >> > >> There are limitations if the distribution of numbers of > >> different orders of mTHP changes a lot. e.g. there are a lot > >> of nonfull cluster assign to order A while later time there > >> are a lot of order B allocation while very little allocation > >> in order A. Currently the cluster used by order A will not > >> reused by order B unless the cluster is 100% empty. > >> > >> This situation is best addressed by the longer term "swap > >> buddy allocator", in future patches. > >> --- > >> include/linux/swap.h | 4 ++++ > >> mm/swapfile.c | 25 +++++++++++++++++++++++-- > >> 2 files changed, 27 insertions(+), 2 deletions(-) > >> > >> diff --git a/include/linux/swap.h b/include/linux/swap.h > >> index 0d3906eff3c9..1b7f0794b9bf 100644 > >> --- a/include/linux/swap.h > >> +++ b/include/linux/swap.h > >> @@ -255,10 +255,12 @@ struct swap_cluster_info { > >> * cluster > >> */ > >> unsigned int count:16; > >> + unsigned int order:8; > >> unsigned int flags:8; > >> struct list_head next; > >> }; > >> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > >> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ > >> > >> > >> /* > >> @@ -297,6 +299,8 @@ struct swap_info_struct { > >> unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > >> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > >> struct list_head free_clusters; /* free clusters list */ > >> + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; > >> + /* list of cluster that contains at least one free slot */ > >> unsigned int lowest_bit; /* index of first free in swap_map */ > >> unsigned int highest_bit; /* index of last free in swap_map */ > >> unsigned int pages; /* total of usable pages of swap */ > >> diff --git a/mm/swapfile.c b/mm/swapfile.c > >> index 205a60c5f9cb..51923aba500e 100644 > >> --- a/mm/swapfile.c > >> +++ b/mm/swapfile.c > >> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > >> > >> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > >> { > >> + if (ci->flags & CLUSTER_FLAG_NONFULL) > >> + list_move_tail(&ci->next, &si->free_clusters); > >> + else > >> + list_add_tail(&ci->next, &si->free_clusters); > >> ci->flags = CLUSTER_FLAG_FREE; > >> - list_add_tail(&ci->next, &si->free_clusters); > >> } > >> > >> /* > >> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste > >> ci->count--; > >> > >> if (!ci->count) > >> - free_cluster(p, ci); > >> + return free_cluster(p, ci); > >> + > >> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { > >> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]); > >> + ci->flags |= CLUSTER_FLAG_NONFULL; > >> + } > >> } > >> > >> /* > >> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > >> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > >> list_del(&ci->next); > >> spin_lock(&ci->lock); > >> + ci->order = order; > >> + ci->flags = 0; > >> + spin_unlock(&ci->lock); > >> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > >> + } else if (!list_empty(&si->nonfull_clusters[order])) { > >> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next); > >> + list_del(&ci->next); > >> + spin_lock(&ci->lock); > >> ci->flags = 0; > >> spin_unlock(&ci->lock); > >> tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > > > > This looks wrong to me; if the cluster is on the nonfull list then it will have > > had some entries already allocated (by another cpu). So pointing tmp to the > > first block in the cluster will never yield a free block. The cpu from which you > > are stealing the cluster stores the next free block location in its per-cpu > > structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is > > a better approach than the nonfull list? > > Ahh; of course the cluster scan below will move this along to a free block. You mean the (offset < max) loop, right? Agree. Chris > > > > > Additionally, this cluster will be stored back to this cpu's current cluster at > > the bottom of the function. That may or may not be what you intended. > > > >> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > >> break; > >> tmp += nr_pages; > >> } > >> + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order); > >> unlock_cluster(ci); > >> } > >> if (tmp >= max) { > >> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > >> ci = lock_cluster(si, offset); > >> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > >> ci->count = 0; > >> + ci->order = 0; > >> ci->flags = 0; > >> free_cluster(si, ci); > >> unlock_cluster(ci); > >> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > >> INIT_LIST_HEAD(&p->free_clusters); > >> INIT_LIST_HEAD(&p->discard_clusters); > >> > >> + for (i = 0; i < SWAP_NR_ORDERS; i++) > >> + INIT_LIST_HEAD(&p->nonfull_clusters[i]); > >> + > >> for (i = 0; i < swap_header->info.nr_badpages; i++) { > >> unsigned int page_nr = swap_header->info.badpages[i]; > >> if (page_nr == 0 || page_nr > swap_header->info.last_page) > >> > > >