Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <chrisl@xxxxxxxxxx> · Fri, 7 Jun 2024 11:57:05 -0700

On Fri, Jun 7, 2024 at 3:49 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 30/05/2024 08:49, Barry Song wrote:
> > On Wed, May 29, 2024 at 9:04 AM Chris Li <chrisl@xxxxxxxxxx> wrote:
> >>
> >> I am spinning a new version for this series to address two issues
> >> found in this series:
> >>
> >> 1) Oppo discovered a bug in the following line:
> >> +               ci = si->cluster_info + tmp;
> >> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> That is a serious bug but trivial to fix.
> >>
> >> 2) order 0 allocation currently blindly scans swap_map disregarding
> >> the cluster->order. Given enough order 0 swap allocations(close to the
> >> swap file size) the order 0 allocation head will eventually sweep
> >> across the whole swapfile and destroy other cluster order allocations.
> >>
> >> The short term fix is just skipping clusters that are already assigned
> >> to higher orders.
> >>
> >> In the long term, I want to unify the non-SSD to use clusters for
> >> locking and allocations as well, just try to follow the last
> >> allocation (less seeking) as much as possible.
> >
> > Hi Chris,
> >
> > I am sharing some new test results with you. This time, we used two
> > zRAM devices by modifying get_swap_pages().
> >
> > zram0 -> dedicated for order-0 swpout
> > zram1 -> dedicated for order-4 swpout
> >
> > We allocate a generous amount of space for zRAM1 to ensure it never gets full
> > and always has ample free space. However, we found that Ryan's approach
> > does not perform well even in this straightforward scenario. Despite zRAM1
> > having 80% of its space remaining, we still experience issues obtaining
> > contiguous swap slots and encounter a high swpout_fallback ratio.
> >
> > Sorry for the report, Ryan :-)
>
> No problem; clearly it needs to be fixed, and I'll help where I can. I'm pretty
> sure that this is due to fragmentation preventing clusters from being freed back
> to the free list.
>
> >
> > In contrast, with your patch, we consistently see the thp_swpout_fallback ratio
> > at 0%, indicating a significant improvement in the situation.
>
> Unless I've misunderstood something critical, Chris's change is just allowing a
> cpu to steal a block from another cpu's current cluster for that order. So it

No, that is not the main change. The main change is to allow the CPU
to allocate from the nonfull and non-empty cluster, which are not in
any CPU's current cluster, not in the empty list either. The current
patch does not prevent the CPU from stealing from the other CPU
current order. It will get addressed in V2.

> just takes longer (approx by a factor of the number of cpus in the system) to
> get to the state where fragmentation is causing fallbacks? As I said in the
> other thread, I think the more robust solution is to implement scanning for high
> order blocks.

That will introduce more fragmentation to the high order cluster, and
will make it harder to allocate high order swap entry later.

Please see my previous email for the usage case and the goal of the change.
https://lore.kernel.org/linux-mm/CANeU7QnVzqGKXp9pKDLWiuhqTvBxXupgFCRXejYhshAjw6uDyQ@xxxxxxxxxxxxxx/T/#mf431a743e458896c2ab4a4077b103341313c9cf4

Let's discuss whether the usage case and the goal makes sense or not.

Chris