Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <chrisl@xxxxxxxxxx> · Wed, 5 Jun 2024 00:40:31 -0700

On Tue, Jun 4, 2024 at 12:29 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Kairui Song <ryncsn@xxxxxxxxx> writes:
>
> > On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> > Isn't limiting order-0 allocation breaks the bottom line that order-0
> > allocation is the first class citizen, and should not fail if there is
> > space?
>
> Sorry for confusing words.  I mean limiting maximum number order-0 swap
> entries allocation in workloads, instead of limiting that in kernel.

What interface does it use to limit the order 0 swap entries?

I was thinking the kernel would enforce the high order swap space
reservation just like hugetlbfs does on huge pages.
We will need to introduce some interface to specify the reservation.

>
> > Just my two cents...
> >
> > I had a try locally based on Chris's work, allowing order 0 to use
> > nonfull_clusters as Ying has suggested, and starting with low order
> > and increase the order until nonfull_cluster[order] is not empty, that
> > way higher order is just better protected, because unless we ran out
> > of free_cluster and nonfull_cluster, direct scan won't happen.
> >
> > More concretely, I applied the following changes, which didn't change
> > the code much:
> > - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
> > free_clusters, then discard_cluster.
> > - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
> > nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
> > returns false.
> >
> > A quick test still using the memtier test, but decreased the swap
> > device size from 10G to 8g for higher pressure.
> >
> > Before:
> > hugepages-32kB/stats/swpout:34013
> > hugepages-32kB/stats/swpout_fallback:266
> > hugepages-512kB/stats/swpout:0
> > hugepages-512kB/stats/swpout_fallback:77
> > hugepages-2048kB/stats/swpout:0
> > hugepages-2048kB/stats/swpout_fallback:1
> > hugepages-1024kB/stats/swpout:0
> > hugepages-1024kB/stats/swpout_fallback:0
> > hugepages-64kB/stats/swpout:35088
> > hugepages-64kB/stats/swpout_fallback:66
> > hugepages-16kB/stats/swpout:31848
> > hugepages-16kB/stats/swpout_fallback:402
> > hugepages-256kB/stats/swpout:390
> > hugepages-256kB/stats/swpout_fallback:7244
> > hugepages-128kB/stats/swpout:28573
> > hugepages-128kB/stats/swpout_fallback:474
> >
> > After:
> > hugepages-32kB/stats/swpout:31448
> > hugepages-32kB/stats/swpout_fallback:3354
> > hugepages-512kB/stats/swpout:30
> > hugepages-512kB/stats/swpout_fallback:33
> > hugepages-2048kB/stats/swpout:2
> > hugepages-2048kB/stats/swpout_fallback:0
> > hugepages-1024kB/stats/swpout:0
> > hugepages-1024kB/stats/swpout_fallback:0
> > hugepages-64kB/stats/swpout:31255
> > hugepages-64kB/stats/swpout_fallback:3112
> > hugepages-16kB/stats/swpout:29931
> > hugepages-16kB/stats/swpout_fallback:3397
> > hugepages-256kB/stats/swpout:5223
> > hugepages-256kB/stats/swpout_fallback:2351
> > hugepages-128kB/stats/swpout:25600
> > hugepages-128kB/stats/swpout_fallback:2194
> >
> > High order (256k) swapout rate are significantly higher, 512k is now
> > possible, which indicate high orders are better protected, lower
> > orders are sacrificed but seems worth it.
>
> Yes.  I think that this reflects another aspect of the problem.  In some
> situations, it's better to steal one high-order cluster and use it for
> order-0 allocation instead of scattering order-0 allocation in random
> high-order clusters.

Agree, the  scan loop on swap_map[] has the worst pollution.

Chris