Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <chrisl@xxxxxxxxxx> · Wed, 5 Jun 2024 00:08:12 -0700

On Thu, May 30, 2024 at 7:37 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Chris Li <chrisl@xxxxxxxxxx> writes:
>
> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> > because android does not have too many cpu. We are talking about a
> > handful of clusters, which might not justify the code complexity. It
> > does not change the behavior that order 0 can pollut higher order.
>
> I have a feeling that you don't really know why swap_map[] is scanned.
> I suggest you to do more test and tracing to find out the reason.  I
> suspect that there are some non-full cluster collection issues.

Swap_map[] is scanned because of running out of non full clusters.
This can happen because Android tries to make full use of the swapfile.
However, once the swap_map[] scan happens, the non full cluster is polluted.

I currently don't have a local reproduction of the issue Barry reported.
However here is some data point:
Two swap files, one for high order allocation only with this patch. No
fall back.
If there is a non-full cluster collection issue, we should see the
fall back in this case as well.

BTW, same setup without this patch series it will fall back on the
high order allocation as well.

>
> >> Another issue is nonfull_cluster[order1] cannot be used for
> >> nonfull_cluster[order2].  In definition, we should not fail order 0
> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
> >> allocation.  This can avoid to scan swap_map[] too.  This may be not
> >> perfect, but it is the simplest first step implementation.  You can
> >> optimize based on it further.
> >
> > Yes, that is listed as the limitation of this cluster order approach.
> > Initially we need to support one order well first. We might choose
> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
> > are too big. The sweet spot might be some there in between.  If we can
> > support one order well, we can demonstrate the value of the mTHP. We
> > can worry about other mix orders later.
> >
> > Do you have any suggestions for how to prevent the order 0 polluting
> > the higher order cluster? If we allow that to happen, then it defeats
> > the goal of being able to allocate higher order swap entries. The
> > tricky question is we don't know how much swap space we should reserve
> > for each order. We can always break higher order clusters to lower
> > order, but can't do the reserves. The current patch series lets the
> > actual usage determine the percentage of the cluster for each order.
> > However that seems not enough for the test case Barry has. When the
> > app gets OOM kill that is where a large swing of order 0 swap will
> > show up and not enough higher order usage for the brief moment. The
> > order 0 swap entry will pollute the high order cluster. We are
> > currently debating a "knob" to be able to reserve a certain % of swap
> > space for a certain order. Those reservations will be guaranteed and
> > order 0 swap entry can't pollute them even when it runs out of swap
> > space. That can make the mTHP at least usable for the Android case.
>
> IMO, the bottom line is that order-0 allocation is the first class
> citizen, we must keep it optimized.  And, OOM with free swap space isn't
> acceptable.  Please consider the policy we used for page allocation.

We need to make order-0 and high order allocation both can work after
the initial pass of allocating from empty clusters.
Only order-0 allocation work is not good enough.

In the page allocation side, we have the hugetlbfs which reserve some
memory for high order pages.
We should have similar things to allow reserve some high order swap
entries without getting polluted by low order one.

>
> > Do you see another way to protect the high order cluster polluted by
> > lower order one?
>
> If we use high-order page allocation as reference, we need something
> like compaction to guarantee high-order allocation finally.  But we are
> too far from that.

We should consider reservation for high-order swap entry allocation
similar to hugetlbfs for memory.
Swap compaction will be very complicated because it needs to scan the
PTE to migrate the swap entry. It might be easier to support folio
write out compound discontiguous swap entries. That is another way to
address the fragmentation issue. We are also too far from that as
right now.

>
> For specific configuration, I believe that we can get reasonable
> high-order swap entry allocation success rate for specific use cases.
> For example, if we only do limited maximum number order-0 swap entries
> allocation, can we keep high-order clusters?

Yes we can. Having a knob to reserve some high order swap space.
Limiting order 0 is the same as having some high order swap entries
reserved.

That is a short term solution.

Chris