Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements

Barry Song <baohua@xxxxxxxxxx> · Fri, 21 Jun 2024 20:48:07 +1200

On Wed, Jun 19, 2024 at 9:18 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 19/06/2024 10:11, Barry Song wrote:
> > On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> >>
> >> Hi All,
> >>
> >> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> >> entry allocator. But Barry posted a test program and results at [2] showing that
> >> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> >> some cases). I was interested in why that might be and ended up putting this PoC
> >> patch set together to try to get a better understanding. This series ends up
> >> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> >> much testing beyond that (yet) but thought it was worth posting on the strength
> >> of that result alone.
> >>
> >> At a high level this works in a similar way to Chris's series; it marks a
> >> cluster as being for a particular order and if a new cluster cannot be allocated
> >> then it scans through the existing non-full clusters. But it does it by scanning
> >> through the clusters rather than assembling them into a list. Cluster flags are
> >> used to mark clusters that have been scanned and are known not to have enough
> >> contiguous space, so the efficiency should be similar in practice.
> >>
> >> Because its not based around a linked list, there is less churn and I'm
> >> wondering if this is perhaps easier to review and potentially even get into
> >> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> >> for Chris's series? I know Chris has a larger roadmap of improvements, so at
> >> best I see this as a tactical fix that will ultimately be superseeded by Chris's
> >> work.
> >>
> >> There are a few differences to note vs Chris's series:
> >>
> >> - order-0 fallback scanning is still allowed in any cluster; the argument in the
> >>   past was that swap should always use all the swap space, so I've left this
> >>   mechanism in. It is only a fallback though; first the the new per-order
> >>   scanner is invoked, even for order-0, so if there are free slots in clusters
> >>   already assigned for order-0, then the allocation will go there.
> >>
> >> - CPUs can steal slots from other CPU's current clusters; those clusters remain
> >>   scannable while they are current for a CPU and are only made unscannable when
> >>   no more CPUs are scanning that particular cluster.
> >>
> >> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
> >>   as I understand it, the original intent of a per-cpu current cluster was to
> >>   get pages for an application adjacent in the swap to speed up IO.
> >>
> >> I'd be keen to hear if you think we could get something like this into v6.10 to
> >> fix the mess - I'm willing to work quickly to address comments and do more
> >> testing. If not, then this is probably just a distraction and we should
> >> concentrate on Chris's series.
> >
> > Ryan, thank you very much for accomplishing this.
> >
> > I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of
> > add_to_swap() for both your approach and Chris's. I will update you with
> > the results ASAP.
>
> Ahh great - look forward to the results!

Essentially, we are measuring two types of latency:
* Small folio swap allocation
 * Large folio swap allocation

The concept code is like

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 994723cef821..a608b916ed2f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,10 +185,18 @@ bool add_to_swap(struct folio *folio)
        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
        VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);

+       start_time = ktime_get();
+
        entry = folio_alloc_swap(folio);
        if (!entry.val)
                return false;

+       end_time = ktime_get();
+       if (folio_test_large(folio))
+               trace_large_swap_allocation_latency(ktime_sub(end_time
- start_time));
+       else
+               trace_small_swap_allocation_latency(ktime_sub(end_time
- start_time));
+
        /*
         * XArray node allocations from PF_MEMALLOC contexts could
         * completely exhaust the page allocator. __GFP_NOMEMALLOC


Then, we'll generate histograms for both large and small allocation
latency. We're currently
encountering some setup issues. Once we have the data, I'll provide
updates to you and Chris.
Additionally, I noticed some comments suggesting that Chris's patch
might negatively impact
the swap allocation latency of small folios. Perhaps the data can help
clarify this.

>
> >
> > I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has
> > caused a couple of crashes.
> >
> >>
> >> This applies on top of v6.10-rc4.
> >>
> >> [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@xxxxxxxxxx/
> >> [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@xxxxxxxxx/
> >>
> >> Thanks,
> >> Ryan
> >>
> >> Ryan Roberts (5):
> >>   mm: swap: Simplify end-of-cluster calculation
> >>   mm: swap: Change SWAP_NEXT_INVALID to highest value
> >>   mm: swap: Track allocation order for clusters
> >>   mm: swap: Scan for free swap entries in allocated clusters
> >>   mm: swap: Optimize per-order cluster scanning
> >>
> >>  include/linux/swap.h |  18 +++--
> >>  mm/swapfile.c        | 164 ++++++++++++++++++++++++++++++++++++++-----
> >>  2 files changed, 157 insertions(+), 25 deletions(-)
> >>
> >> --
> >> 2.43.0
> >>
>

Thanks
Barry