Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

"Huang, Ying" <ying.huang@xxxxxxxxx> · Fri, 26 Jul 2024 13:52:02 +0800

Chris Li <chrisl@xxxxxxxxxx> writes:

> On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > If the freeing of swap entry is random distribution. You need 16
>> > continuous swap entries free at the same time at aligned 16 base
>> > locations. The total number of order 4 free swap space add up together
>> > is much lower than the order 0 allocatable swap space.
>> > If having one entry free is 50% probability(swapfile half full), then
>> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
>> > If the swapfile is 80% full, that number drops to 6.5 E -12.
>>
>> This depends on workloads.  Quite some workloads will show some degree
>> of spatial locality.  For a workload with no spatial locality at all as
>> above, mTHP may be not a good choice at the first place.
>
> The fragmentation comes from the order 0 entry not from the mTHP. mTHP
> have their own valid usage case, and should be separate from how you
> use the order 0 entry. That is why I consider this kind of strategy
> only works on the lucky case. I would much prefer the strategy that
> can guarantee work not depend on luck.

It seems that you have some perfect solution.  Will learn it when you
post it.

>> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>> >>   clusters available.
>> >
>> > Exactly.
>> >
>> >>
>> >> So, we need a way to migrate non-full clusters among orders to adjust to
>> >> the various situations automatically.
>> >
>> > There is no easy way to migrate swap entries to different locations.
>> > That is why I like to have discontiguous swap entries allocation for
>> > mTHP.
>>
>> We suggest to migrate non-full swap clsuters among different lists, not
>> swap entries.
>
> Then you have the down side of reducing the number of total high order
> clusters. By chance it is much easier to fragment the cluster than
> anti-fragment a cluster.  The orders of clusters have a natural
> tendency to move down rather than move up, given long enough time of
> random access. It will likely run out of high order clusters in the
> long run if we don't have any separation of orders.

As my example above, you may have almost 0 high-order clusters forever.
So, your solution only works for very specific use cases.  It's not a
general solution.

>> >> But yes, data is needed for any performance related change.
>>
>> BTW: I think non-full cluster isn't a good name.  Partial cluster is
>> much better and follows the same convention as partial slab.
>
> I am not opposed to it. The only reason I hold off on the rename is
> because there are patches from Kairui I am testing depending on it.
> Let's finish up the V5 patch with the swap cache reclaim code path
> then do the renaming as one batch job. We actually have more than one
> list that has the clusters partially full. It helps reduce the repeat
> scan of the cluster that is not full but also not able to allocate
> swap entries for this order.  Just the name of one of them as
> "partial" is not precise either. Because the other lists are also
> partially full. We'd better give them precise meaning systematically.

I don't think that it's hard to do a search/replace before the next
version.

--
Best Regards,
Huang, Ying