[RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements

Ryan Roberts <ryan.roberts@xxxxxxx> · Wed, 19 Jun 2024 00:26:40 +0100

Hi All,

Chris has been doing great work at [1] to clean up my mess in the mTHP swap
entry allocator. But Barry posted a test program and results at [2] showing that
even with Chris's changes, there are still some fallbacks (around 5% - 25% in
some cases). I was interested in why that might be and ended up putting this PoC
patch set together to try to get a better understanding. This series ends up
achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
much testing beyond that (yet) but thought it was worth posting on the strength
of that result alone.

At a high level this works in a similar way to Chris's series; it marks a
cluster as being for a particular order and if a new cluster cannot be allocated
then it scans through the existing non-full clusters. But it does it by scanning
through the clusters rather than assembling them into a list. Cluster flags are
used to mark clusters that have been scanned and are known not to have enough
contiguous space, so the efficiency should be similar in practice.

Because its not based around a linked list, there is less churn and I'm
wondering if this is perhaps easier to review and potentially even get into
v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
for Chris's series? I know Chris has a larger roadmap of improvements, so at
best I see this as a tactical fix that will ultimately be superseeded by Chris's
work.

There are a few differences to note vs Chris's series:

- order-0 fallback scanning is still allowed in any cluster; the argument in the
  past was that swap should always use all the swap space, so I've left this
  mechanism in. It is only a fallback though; first the the new per-order
  scanner is invoked, even for order-0, so if there are free slots in clusters
  already assigned for order-0, then the allocation will go there.

- CPUs can steal slots from other CPU's current clusters; those clusters remain
  scannable while they are current for a CPU and are only made unscannable when
  no more CPUs are scanning that particular cluster.

- I'm preferring to allocate a free cluster ahead of per-order scanning, since,
  as I understand it, the original intent of a per-cpu current cluster was to
  get pages for an application adjacent in the swap to speed up IO.

I'd be keen to hear if you think we could get something like this into v6.10 to
fix the mess - I'm willing to work quickly to address comments and do more
testing. If not, then this is probably just a distraction and we should
concentrate on Chris's series.

This applies on top of v6.10-rc4.

[1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@xxxxxxxxxx/
[2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@xxxxxxxxx/

Thanks,
Ryan

Ryan Roberts (5):
  mm: swap: Simplify end-of-cluster calculation
  mm: swap: Change SWAP_NEXT_INVALID to highest value
  mm: swap: Track allocation order for clusters
  mm: swap: Scan for free swap entries in allocated clusters
  mm: swap: Optimize per-order cluster scanning

 include/linux/swap.h |  18 +++--
 mm/swapfile.c        | 164 ++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 157 insertions(+), 25 deletions(-)

--
2.43.0