Re: [PATCH v3 0/2] mm: swap: mTHP swap allocator base on swap cluster order

"Huang, Ying" <ying.huang@xxxxxxxxx> · Fri, 26 Jul 2024 14:57:25 +0800

Chris Li <chrisl@xxxxxxxxxx> writes:

> On Mon, Jun 24, 2024 at 7:36 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>>
>> Chris Li <chrisl@xxxxxxxxxx> writes:
>>
>> > On Wed, Jun 19, 2024 at 7:32 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> >>
>> >> Chris Li <chrisl@xxxxxxxxxx> writes:
>> >>
>> >> > This is the short term solutiolns "swap cluster order" listed
>> >> > in my "Swap Abstraction" discussion slice 8 in the recent
>> >> > LSF/MM conference.
>> >> >
>> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> >> > orders" is introduced, it only allocates the mTHP swap entries
>> >> > from new empty cluster list.  It has a fragmentation issue
>> >> > reported by Barry.
>> >> >
>> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/
>> >> >
>> >> > The reason is that all the empty cluster has been exhausted while
>> >> > there are planty of free swap entries to in the cluster that is
>> >> > not 100% free.
>> >> >
>> >> > Remember the swap allocation order in the cluster.
>> >> > Keep track of the per order non full cluster list for later allocation.
>> >> >
>> >> > User impact: For users that allocate and free mix order mTHP swapping,
>> >> > It greatly improves the success rate of the mTHP swap allocation after the
>> >> > initial phase.
>> >> >
>> >> > Barry provides a test program to show the effect:
>> >> > https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@xxxxxxxxx/
>> >> >
>> >> > Without:
>> >> > $ mthp-swapout
>> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 5: swpout inc: 110, swpout fallback inc: 117, Fallback percentage: 51.54%
>> >> > Iteration 6: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
>> >> > Iteration 7: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
>> >> > Iteration 8: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00%
>> >> > Iteration 9: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
>> >> > Iteration 10: swpout inc: 0, swpout fallback inc: 216, Fallback percentage: 100.00%
>> >> > Iteration 11: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00%
>> >> > Iteration 12: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
>> >> > Iteration 13: swpout inc: 0, swpout fallback inc: 214, Fallback percentage: 100.00%
>> >> >
>> >> > $ mthp-swapout -s
>> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 5: swpout inc: 33, swpout fallback inc: 197, Fallback percentage: 85.65%
>> >> > Iteration 6: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
>> >> > Iteration 7: swpout inc: 0, swpout fallback inc: 223, Fallback percentage: 100.00%
>> >> > Iteration 8: swpout inc: 0, swpout fallback inc: 219, Fallback percentage: 100.00%
>> >> > Iteration 9: swpout inc: 0, swpout fallback inc: 212, Fallback percentage: 100.00%
>> >> >
>> >> > With:
>> >> > $ mthp-swapout
>> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 2: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 4: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 6: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > ...
>> >> > Iteration 94: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 95: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 96: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 97: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 98: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 100: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> >
>> >> > $ mthp-swapout -s
>> >> > Iteration 1: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 2: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 4: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 5: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 6: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 7: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 8: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > ...
>> >> > Iteration 94: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 95: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 96: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 97: swpout inc: 220, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 98: swpout inc: 216, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >> > Iteration 100: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
>> >>
>> >> Unfortunately, the data is gotten using a special designed test program
>> >> which always swap-in pages with swapped-out size.  I don't know whether
>> >> such workloads exist in reality.  Otherwise, you need to wait for mTHP
>> >
>> > The test program is designed to simulate mTHP swap behavior using
>> > zsmalloc and 64KB buffer.
>> > If we insist on only designing for existing workloads, then zsmalloc
>> > using 64KB buffer usage will never be able to run, exactly due the
>> > kernel has high failure rate allocating swap entries for 64KB. There
>> > is a bit of a chick and egg problem there, such a usage can not exist
>> > because the kernel can't support it yet. Kernel can't add patches to
>> > support it because such simulation tests are not "real".
>> >
>> > We need to break this cycle to support something new.
>> >
>> >> swap-in to be merged firstly, and people reach consensus that we should
>> >> always swap-in pages with swapped-out size.
>> >
>> > We don't have to be always. We can identify the situation that makes
>> > sense. For the zram/zsmalloc 64K buffer usage case, swap out as the
>> > same swap in size makes sense.
>> > I think we have agreement on such zsmalloc 64K usage cases we do want
>> > to support.
>> >
>> >>
>> >> Alternately, we can make some design adjustment to make the patchset
>> >> work in current situation (mTHP swap-out, normal page swap-in).
>> >>
>> >> - One non-full cluster list for each order (same as current design)
>> >>
>> >> - When one swap entry is freed, check whether one "order+1" swap entry
>> >>   becomes free, if so, move the cluster to "order+1" non-full cluster
>> >>   list.
>> >
>> > In the intended zsmalloc usage case, there is no order+1 swap entry
>> > request.
>>
>> This my main concern about this series.  Only the Android use cases are
>> considered.  The general use cases are just ignored.  Is it hard to
>> consider or test a normal swap partition on your development machine?
>
> Please see the V4 cover letter. The V4 already has the SSD, zram and
> HDD stress testing.
> Of course I want to make sure the allocator works well with Barry's
> mthp test case as well.
>
>> > Moving the cluster to "order+1" will make less cluster available for "order".
>> > For that usage case it is negative gain.
>>
>> The "order+1" cluster can be used to allocate "order" cluster when
>> existing "order" cluster is used up.
>>
>> And in this way, we can protect clusters with more free spaces so that
>> they may become free.
>>
>> >> - When allocate swap entry with "order", get cluster from free, "order",
>> >>   "order+1", ... non-full cluster list.  If all are empty, fallback to
>> >
>> > I don't see that it is useful for the zsmalloc 64K buffer usage case.
>> > There will be order 0 and order 4 and nothing else.
>> >
>> > How about let's keep it simple for now. If we identify some workload
>> > this algorithm can help. We can do that as a follow up step.
>>
>> The simple design isn't flexible enough for your workloads too.  For
>> example,
>>
>> - Initially, almost only order-0 pages are swapped out, most non-full
>>   clusters are order-0.
>>
>> - Later, quite some order-0 swap entries are freed so that there are
>>   quite some order-4 swap entries available.
>>
>> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>>   clusters available.
>>
>> So, we need a way to migrate non-full clusters among orders to adjust to
>> the situations automatically.
>
> Depends on how lucky it is to form the order-4 cluster naturally. The
> odds of forming the order-4 cluster naturally in random swap
> allocation/ free case is very low. I have the number in my other email
> thread.
> Anyway, if we convince this payout for the complexity it introduces,
> we can do that as follow up steps. Try to keep things simple at first
> for the review benefit.
>
>>
>> >>   order 0.
>> >>
>> >> Do you think that this works?
>> >>
>> >> > Reported-by: Barry Song <21cnbao@xxxxxxxxx>
>> >> > Signed-off-by: Chris Li <chrisl@xxxxxxxxxx>
>> >> > ---
>> >> > Changes in v3:
>> >> > - Using V1 as base.
>> >> > - Rename "next" to "list" for the list field, suggested by Ying.
>> >> > - Update comment for the locking rules for cluster fields and list,
>> >> >   suggested by Ying.
>> >> > - Allocate from the nonfull list before attempting free list, suggested
>> >> >   by Kairui.
>> >>
>> >> Haven't looked into this.  It appears that this breaks the original
>> >> discard behavior which helps performance of some SSD, please refer to
>> >
>> > Can you clarify by "discard" you mean SSD discard command or just the
>> > way swap allocator recycles free clusters?
>>
>> The SSD discard command, like in the following URL,
>>
>> https://en.wikipedia.org/wiki/Trim_(computing)
>
> Thanks. I know what an SSD discard command is. Want to understand why
> that behavior is preferred.
>
> So the reasoning to prefer a new free block rather than a recent
> particle free cluster is to let the previous written cluster have a
> higher chance to issue the discard command?
>
> This preferred new block behavior is actually not friendly to SSD from
> a wearing point of view.
> Take this example:
> Let say the data need to allocate and free from swap. At any given
> time the swap usage is 1G. The swap SSD drive is 16G.
> Let say the allocation and free are at random 4K page locations. There
> is totally 64G swap data needed to write to swap, but at any given
> time there is only 1G data occupite on swapfile.
>
> a) If you always prefer new free blocks. Then the swap data will
> eventually write at all 16G drives then random write to full 16G.
> Chance of forming a free cluster so a discard command can be issued is
> very low. (15/16)**512 = 4.4E-15. From SSD point of view, it does not
> know most of the data written to 16G drive is not used. When a page is
> free on a swapfile, SSD drive doesn't know about it. It sees 4K random
> writes to all 16G of the drive, total 64G data written.
>
> b) If you always prefer a non full cluster first over a new cluster.
> The 64G data will concentrate random writing to the first 1G of drive
> location. Total 64G data written.
>
> I consider b) are more friendly to SSD than a). Because concentrate
> the write into the first 1G location. The SSD can know the data
> overwritten in those 1G has internally obsolete, so it can internally
> GC the those overwritten data without a discard command. Where a)
> random 4K writes to the whole drive without much discard at all. Full
> SSD doing random writes is a bad combination from a wearing point of
> view.
>
> Just my 2 cents. Anyway I revert the V4 to use free cluster before
> nonfull cluster just to behave the same as previously.
>
>> >> commit 2a8f94493432 ("swap: change block allocation algorithm for SSD").
>> >
>> > I did read that change log. Help me understand in more detail which
>> > discard behavior you have in mind. A lot of low end micro SD cards
>> > have proper FTL wear leveling now, ssd even better on that.
>>
>> It's not FTL, it's discard/trim for SSD as above.
>
> Thanks for the clarification.
>
>>
>> >> And as pointed out by Ryan, this may reduce the opportunity of the
>> >> sequential block device writing during swap-out, which may hurt
>> >> performance of SSD too.
>> >
>> > Only at the initial phase. If the swap IO continues, after the first
>> > pass fills up the swap file, the write will be random on the swapfile
>> > anyway. Because the swapfile only issues 2M discards commands when all
>> > 512 4K pages are free. The discarded area will be much smaller than
>> > the free area on swapfile. That combined with the random write page on
>> > the whole swap file. It might produce a worse internal write
>> > amplification for SSD, compared to only writing a subset of the
>> > swapfile area. I would love to hear from someone who understands SSD
>> > internals to confirm or deny my theory.
>>
>> It depends on workloads.  Some workloads will have more severe
>> fragmentation than others.  For example, on quite some machines, the
>> swap devices will be far from being full to avoid possible OOM.
>
> I suspect most of the SSD swap on client devices nowadays are only as
> backup just in case it needs to be swapped.
> There is not much SSD swap IO during normal use. The zram and zswap
> are more actively used in the data center and Android phone case, from
> swap IO ops point of view.

I use a Linux laptop with 16GB DRAM for work.  And I found that the swap
space are almost always used.

>>
>> > Even let's assume the SSD wants a free block over a nonfull cluster
>> > first. Zswap and zram swap are not subject to SSD property. We might
>> > want to have a kernel option to select using  nonfree clusters over
>> > the free one for zram and zswap (ghost swapfile). That will help
>> > contain the fragmented swap area.
>>
>> I suspect that it will help fragmentation avoidance much.  Please prove
>> its effectiveness with data firstly.  It can be a further optimization
>> patch in the series.
>
> Take the above 1GB data written in a 16GB drive example. a) will
> fragment the whole 16GB drive.
> b) will concentrate on the first 1GB location that was used.
>
>>
>> Even if we really need it, we can try to do it without a kernel option.
>> For example, detect whether we are using zram and enable it for zram
>> automatically (through a general flag).
>
> zswap you need to have an option to choose from because it can write
> to the real swappfile as well.
> Do you optimize the swap allocator for the zswap or physical swapfile.

--
Best Regards,
Huang, Ying