Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <chrisl@xxxxxxxxxx> · Wed, 7 Aug 2024 13:42:20 -0700

On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Hi, Chris,
>
> Chris Li <chrisl@xxxxxxxxxx> writes:
>
> > This is the short term solutions "swap cluster order" listed
> > in my "Swap Abstraction" discussion slice 8 in the recent
> > LSF/MM conference.
> >
> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > orders" is introduced, it only allocates the mTHP swap entries
> > from the new empty cluster list.  It has a fragmentation issue
> > reported by Barry.
> >
> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/
> >
> > The reason is that all the empty clusters have been exhausted while
> > there are plenty of free swap entries in the cluster that are
> > not 100% free.
> >
> > Remember the swap allocation order in the cluster.
> > Keep track of the per order non full cluster list for later allocation.
> >
> > This series gives the swap SSD allocation a new separate code path
> > from the HDD allocation. The new allocator use cluster list only
> > and do not global scan swap_map[] without lock any more.
>
> This sounds good.  Can we use SSD allocation method for HDD too?
> We may not need a swap entry allocator optimized for HDD.

Yes, that is the plan as well. That way we can completely get rid of
the old scan_swap_map_slots() code.
However, considering the size of the series, let's focus on the
cluster allocation path first, get it tested and reviewed.

For HDD optimization, mostly just the new block allocations portion
need some separate code path from the new cluster allocator to not do
the per cpu allocation.  Allocating from the non free list doesn't
need to change too much.
>
> Hi, Hugh,
>
> What do you think about this?
>
> > This streamline the swap allocation for SSD. The code matches the
> > execution flow much better.
> >
> > User impact: For users that allocate and free mix order mTHP swapping,
> > It greatly improves the success rate of the mTHP swap allocation after the
> > initial phase.
> >
> > It also performs faster when the swapfile is close to full, because the
> > allocator can get the non full cluster from a list rather than scanning
> > a lot of swap_map entries.
>
> Do you have some test results to prove this?  Or which test below can
> prove this?

The two zram tests are already proving this. The system time
improvement is about 2% on my low CPU count machine.
Kairui has a higher core count machine and the difference is higher
there. The theory is that higher CPU count has higher contentions.

The 2% system time number does not sound like much. But consider this
two factors:
1) swap allocator only takes a small percentage of the overall workload.
2) The new allocator does more work.
The old allocator has a time tick budget. It will abort and fail to
find an entry when it runs out of time budget, even though there are
still some free entries on the swapfile.
The new allocator can get to the last few free swap entries if it is
available. If not then, the new swap allocator will work harder on
swap cache reclaim.