On 19/06/2024 00:26, Ryan Roberts wrote: > Hi All, > > Chris has been doing great work at [1] to clean up my mess in the mTHP swap > entry allocator. But Barry posted a test program and results at [2] showing that > even with Chris's changes, there are still some fallbacks (around 5% - 25% in > some cases). I was interested in why that might be and ended up putting this PoC > patch set together to try to get a better understanding. This series ends up > achieving 0% fallback, even with small folios ("-s") enabled. I haven't done > much testing beyond that (yet) but thought it was worth posting on the strength > of that result alone. > > At a high level this works in a similar way to Chris's series; it marks a > cluster as being for a particular order and if a new cluster cannot be allocated > then it scans through the existing non-full clusters. But it does it by scanning > through the clusters rather than assembling them into a list. Cluster flags are > used to mark clusters that have been scanned and are known not to have enough > contiguous space, so the efficiency should be similar in practice. > > Because its not based around a linked list, there is less churn and I'm > wondering if this is perhaps easier to review and potentially even get into > v6.10-rcX to fix up what's already there, rather than having to wait until v6.11 > for Chris's series? I know Chris has a larger roadmap of improvements, so at > best I see this as a tactical fix that will ultimately be superseeded by Chris's > work. > > There are a few differences to note vs Chris's series: > > - order-0 fallback scanning is still allowed in any cluster; the argument in the > past was that swap should always use all the swap space, so I've left this > mechanism in. It is only a fallback though; first the the new per-order > scanner is invoked, even for order-0, so if there are free slots in clusters > already assigned for order-0, then the allocation will go there. > > - CPUs can steal slots from other CPU's current clusters; those clusters remain > scannable while they are current for a CPU and are only made unscannable when > no more CPUs are scanning that particular cluster. > > - I'm preferring to allocate a free cluster ahead of per-order scanning, since, > as I understand it, the original intent of a per-cpu current cluster was to > get pages for an application adjacent in the swap to speed up IO. [...] I've been having a play, trying to extend this to actively free swap entries to make sufficient space for a new allcoation if the entries are already in swap cache. (To be clear, I'm not pursuing this series for inclusion, but was just trying to put some numbers to the idea, which could later be added to Chris's series if it makes sense). However, I'm running into an unexpected issue; It was my previous understanding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a folio in the swap cache and nothing is referencing the swap entry, so the entry can be freed with __try_to_reclaim_swap() - that's the pattern that the existing oder-0 scanner uses. But I'm finding that __try_to_reclaim_swap() always returns 0, indicating that no folio was associated with the swap entry. How can that be, if swap_map[offset] == SWAP_HAS_CACHE ? Here is my code, for reference. Note that "evict" is always set true to be very agressive for now, but the eventual idea is that it would only be set to true when certain criteria are met (e.g. last attempt to allocate for order either failed or had to evict to succeed). With logging added, every single call to __try_to_reclaim_swap() returns 0. (well it also live locks due to that with code as shown, but I hacked around that in experiments). Any ideas, before I dig deeper? diff --git a/mm/swapfile.c b/mm/swapfile.c index f6d78723f0fd..928d61e1d9d5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, } static inline bool swap_range_empty(char *swap_map, unsigned int start, - unsigned int nr_pages) + unsigned int nr_pages, bool *inc_cached) { unsigned int i; - - for (i = 0; i < nr_pages; i++) { - if (swap_map[start + i]) - return false; + bool hit_cache = false; + + if (*inc_cached) { + for (i = 0; i < nr_pages; i++) { + if (swap_map[start + i]) { + if (swap_map[start + i] != SWAP_HAS_CACHE) + return false; + hit_cache = true; + } + } + *inc_cached = hit_cache; + } else { + for (i = 0; i < nr_pages; i++) { + if (swap_map[start + i]) + return false; + } } return true; @@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci; unsigned int tmp, max; unsigned int stop = SWAP_NEXT_INVALID; + bool evict = true; new_cluster: cluster = this_cpu_ptr(si->percpu_cluster); @@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, * natural alignment. */ max = ALIGN(tmp + 1, SWAPFILE_CLUSTER); - ci = lock_cluster(si, tmp); - while (tmp < max) { - if (swap_range_empty(si->swap_map, tmp, nr_pages)) - break; - tmp += nr_pages; +scan_cluster: + if (tmp < max) { + ci = lock_cluster(si, tmp); + while (tmp < max) { + if (swap_range_empty(si->swap_map, tmp, nr_pages, &evict)) + break; + tmp += nr_pages; + } + unlock_cluster(ci); } - unlock_cluster(ci); if (tmp >= max) { cluster_dec_scanners(ci); cluster->next[order] = SWAP_NEXT_INVALID; goto new_cluster; } + if (evict) { + unsigned int off; + int nr; + + spin_unlock(&si->lock); + + for (off = tmp; off < tmp + nr_pages; off += nr) { + nr = 1; + if (READ_ONCE(si->swap_map[off]) == SWAP_HAS_CACHE) { + nr = __try_to_reclaim_swap(si, off, TTRS_ANYWAY); + if (nr < 0) + break; + else if (nr == 0) + nr = 1; + nr = ALIGN(off + 1, nr) - off; + } + } + + spin_lock(&si->lock); + *scan_base = this_cpu_read(*si->cluster_next_cpu); + *offset = *scan_base; + + if (nr < 0) + tmp += nr_pages; + + goto scan_cluster; + } *offset = tmp; *scan_base = tmp; tmp += nr_pages;