Re: [PATCH v3 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <chrisl@xxxxxxxxxx> · Fri, 21 Jun 2024 01:43:49 -0700

On Wed, Jun 19, 2024 at 7:32 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Chris Li <chrisl@xxxxxxxxxx> writes:
>
> > This is the short term solutiolns "swap cluster order" listed
> > in my "Swap Abstraction" discussion slice 8 in the recent
> > LSF/MM conference.
> >
> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > orders" is introduced, it only allocates the mTHP swap entries
> > from new empty cluster list.  It has a fragmentation issue
> > reported by Barry.
> >
> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/
> >
> > The reason is that all the empty cluster has been exhausted while
> > there are planty of free swap entries to in the cluster that is
> > not 100% free.
> >
> > Remember the swap allocation order in the cluster.
> > Keep track of the per order non full cluster list for later allocation.
> >
> > User impact: For users that allocate and free mix order mTHP swapping,
> > It greatly improves the success rate of the mTHP swap allocation after the
> > initial phase.
> >
> > Barry provides a test program to show the effect:
> > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@xxxxxxxxx/
> >
> > Without:
> > $ mthp-swapout
> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54%
> > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
> > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
> > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00%
> > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00%
> > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00%
> > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00%
> >
> > $ mthp-swapout -s
> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65%
> > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
> > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00%
> > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00%
> > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00%
> >
> > With:
> > $ mthp-swapout
> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00%
> > ...
> > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> >
> > $ mthp-swapout -s
> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> > ...
> > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
>
> Unfortunately, the data is gotten using a special designed test program
> which always swap-in pages with swapped-out size.  I don't know whether
> such workloads exist in reality.  Otherwise, you need to wait for mTHP

The test program is designed to simulate mTHP swap behavior using
zsmalloc and 64KB buffer.
If we insist on only designing for existing workloads, then zsmalloc
using 64KB buffer usage will never be able to run, exactly due the
kernel has high failure rate allocating swap entries for 64KB. There
is a bit of a chick and egg problem there, such a usage can not exist
because the kernel can't support it yet. Kernel can't add patches to
support it because such simulation tests are not "real".

We need to break this cycle to support something new.

> swap-in to be merged firstly, and people reach consensus that we should
> always swap-in pages with swapped-out size.

We don't have to be always. We can identify the situation that makes
sense. For the zram/zsmalloc 64K buffer usage case, swap out as the
same swap in size makes sense.
I think we have agreement on such zsmalloc 64K usage cases we do want
to support.

>
> Alternately, we can make some design adjustment to make the patchset
> work in current situation (mTHP swap-out, normal page swap-in).
>
> - One non-full cluster list for each order (same as current design)
>
> - When one swap entry is freed, check whether one "order+1" swap entry
>   becomes free, if so, move the cluster to "order+1" non-full cluster
>   list.

In the intended zsmalloc usage case, there is no order+1 swap entry request.
Moving the cluster to "order+1" will make less cluster available for "order".
For that usage case it is negative gain.

> - When allocate swap entry with "order", get cluster from free, "order",
>   "order+1", ... non-full cluster list.  If all are empty, fallback to

I don't see that it is useful for the zsmalloc 64K buffer usage case.
There will be order 0 and order 4 and nothing else.

How about let's keep it simple for now. If we identify some workload
this algorithm can help. We can do that as a follow up step.

>   order 0.
>
> Do you think that this works?
>
> > Reported-by: Barry Song <21cnbao@xxxxxxxxx>
> > Signed-off-by: Chris Li <chrisl@xxxxxxxxxx>
> > ---
> > Changes in v3:
> > - Using V1 as base.
> > - Rename "next" to "list" for the list field, suggested by Ying.
> > - Update comment for the locking rules for cluster fields and list,
> >   suggested by Ying.
> > - Allocate from the nonfull list before attempting free list, suggested
> >   by Kairui.
>
> Haven't looked into this.  It appears that this breaks the original
> discard behavior which helps performance of some SSD, please refer to

Can you clarify by "discard" you mean SSD discard command or just the
way swap allocator recycles free clusters?

> commit 2a8f94493432 ("swap: change block allocation algorithm for SSD").

I did read that change log. Help me understand in more detail which
discard behavior you have in mind. A lot of low end micro SD cards
have proper FTL wear leveling now, ssd even better on that.

> And as pointed out by Ryan, this may reduce the opportunity of the
> sequential block device writing during swap-out, which may hurt
> performance of SSD too.

Only at the initial phase. If the swap IO continues, after the first
pass fills up the swap file, the write will be random on the swapfile
anyway. Because the swapfile only issues 2M discards commands when all
512 4K pages are free. The discarded area will be much smaller than
the free area on swapfile. That combined with the random write page on
the whole swap file. It might produce a worse internal write
amplification for SSD, compared to only writing a subset of the
swapfile area. I would love to hear from someone who understands SSD
internals to confirm or deny my theory.

Even let's assume the SSD wants a free block over a nonfull cluster
first. Zswap and zram swap are not subject to SSD property. We might
want to have a kernel option to select using  nonfree clusters over
the free one for zram and zswap (ghost swapfile). That will help
contain the fragmented swap area.

Chris