Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <chrisl@xxxxxxxxxx> · Thu, 30 May 2024 14:44:33 -0700

On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Chris Li <chrisl@xxxxxxxxxx> writes:
>
> > Hi Ying,
> >
> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >>
> >> Chris Li <chrisl@xxxxxxxxxx> writes:
> >>
> >> > I am spinning a new version for this series to address two issues
> >> > found in this series:
> >> >
> >> > 1) Oppo discovered a bug in the following line:
> >> > +               ci = si->cluster_info + tmp;
> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> > That is a serious bug but trivial to fix.
> >> >
> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> >> > the cluster->order.
> >>
> >> IIUC, now, we only scan swap_map[] only if
> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> >> That is, if you doesn't run low swap free space, you will not do that.
> >
> > You can still swap space in order 0 clusters while order 4 runs out of
> > free_cluster
> > or nonfull_clusters[order]. For Android that is a common case.
>
> When we fail to allocate order 4, we will fallback to order 0.  Still
> don't need to scan swap_map[].  But after looking at your below reply, I
> realized that the swap space is almost full at most times in your cases.
> Then, it's possible that we run into scanning swap_map[].
> list_empty(&si->free_clusters) &&
> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> many clusters in si->percpu_cluster.  So, if we want to avoid to scan
> swap_map[], we can stop add clusters in si->percpu_cluster when swap
> space runs low.  And maybe take clusters out of si->percpu_cluster
> sometimes.

One idea after reading your reply. If we run out of the
nonfull_cluster[order], we should be able to use other cpu's
si->percpu_cluster[] as well. That is a very small win for Android,
because android does not have too many cpu. We are talking about a
handful of clusters, which might not justify the code complexity. It
does not change the behavior that order 0 can pollut higher order.

>
> Another issue is nonfull_cluster[order1] cannot be used for
> nonfull_cluster[order2].  In definition, we should not fail order 0
> allocation, we need to steal nonfull_cluster[order>0] for order 0
> allocation.  This can avoid to scan swap_map[] too.  This may be not
> perfect, but it is the simplest first step implementation.  You can
> optimize based on it further.

Yes, that is listed as the limitation of this cluster order approach.
Initially we need to support one order well first. We might choose
what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
are too big. The sweet spot might be some there in between.  If we can
support one order well, we can demonstrate the value of the mTHP. We
can worry about other mix orders later.

Do you have any suggestions for how to prevent the order 0 polluting
the higher order cluster? If we allow that to happen, then it defeats
the goal of being able to allocate higher order swap entries. The
tricky question is we don't know how much swap space we should reserve
for each order. We can always break higher order clusters to lower
order, but can't do the reserves. The current patch series lets the
actual usage determine the percentage of the cluster for each order.
However that seems not enough for the test case Barry has. When the
app gets OOM kill that is where a large swing of order 0 swap will
show up and not enough higher order usage for the brief moment. The
order 0 swap entry will pollute the high order cluster. We are
currently debating a "knob" to be able to reserve a certain % of swap
space for a certain order. Those reservations will be guaranteed and
order 0 swap entry can't pollute them even when it runs out of swap
space. That can make the mTHP at least usable for the Android case.

Do you see another way to protect the high order cluster polluted by
lower order one?

>
> And, I checked your code again.  It appears that si->percpu_cluster may
> be put in si->nonfull_cluster[], then be used by another CPU.  Please
> check it.

Ah, good point. I think it does. Let me take a closer look.

Chris

Chris