Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements

Barry Song <baohua@xxxxxxxxxx> · Wed, 19 Jun 2024 21:11:39 +1200

On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> Hi All,
>
> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> entry allocator. But Barry posted a test program and results at [2] showing that
> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> some cases). I was interested in why that might be and ended up putting this PoC
> patch set together to try to get a better understanding. This series ends up
> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> much testing beyond that (yet) but thought it was worth posting on the strength
> of that result alone.
>
> At a high level this works in a similar way to Chris's series; it marks a
> cluster as being for a particular order and if a new cluster cannot be allocated
> then it scans through the existing non-full clusters. But it does it by scanning
> through the clusters rather than assembling them into a list. Cluster flags are
> used to mark clusters that have been scanned and are known not to have enough
> contiguous space, so the efficiency should be similar in practice.
>
> Because its not based around a linked list, there is less churn and I'm
> wondering if this is perhaps easier to review and potentially even get into
> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> for Chris's series? I know Chris has a larger roadmap of improvements, so at
> best I see this as a tactical fix that will ultimately be superseeded by Chris's
> work.
>
> There are a few differences to note vs Chris's series:
>
> - order-0 fallback scanning is still allowed in any cluster; the argument in the
>   past was that swap should always use all the swap space, so I've left this
>   mechanism in. It is only a fallback though; first the the new per-order
>   scanner is invoked, even for order-0, so if there are free slots in clusters
>   already assigned for order-0, then the allocation will go there.
>
> - CPUs can steal slots from other CPU's current clusters; those clusters remain
>   scannable while they are current for a CPU and are only made unscannable when
>   no more CPUs are scanning that particular cluster.
>
> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
>   as I understand it, the original intent of a per-cpu current cluster was to
>   get pages for an application adjacent in the swap to speed up IO.
>
> I'd be keen to hear if you think we could get something like this into v6.10 to
> fix the mess - I'm willing to work quickly to address comments and do more
> testing. If not, then this is probably just a distraction and we should
> concentrate on Chris's series.

Ryan, thank you very much for accomplishing this.

I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of
add_to_swap() for both your approach and Chris's. I will update you with
the results ASAP.

I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has
caused a couple of crashes.

>
> This applies on top of v6.10-rc4.
>
> [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@xxxxxxxxxx/
> [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@xxxxxxxxx/
>
> Thanks,
> Ryan
>
> Ryan Roberts (5):
>   mm: swap: Simplify end-of-cluster calculation
>   mm: swap: Change SWAP_NEXT_INVALID to highest value
>   mm: swap: Track allocation order for clusters
>   mm: swap: Scan for free swap entries in allocated clusters
>   mm: swap: Optimize per-order cluster scanning
>
>  include/linux/swap.h |  18 +++--
>  mm/swapfile.c        | 164 ++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 157 insertions(+), 25 deletions(-)
>
> --
> 2.43.0
>

Thanks
Barry