Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order

Kairui Song <ryncsn@xxxxxxxxx> · Mon, 19 Aug 2024 00:59:41 +0800

On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@xxxxxxxxxx> wrote:
>
> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >
> > Chris Li <chrisl@xxxxxxxxxx> writes:
> >
> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> > >>
> > >> Hi, Chris,
> > >>
> > >> Chris Li <chrisl@xxxxxxxxxx> writes:
> > >>
> > >> > This is the short term solutions "swap cluster order" listed
> > >> > in my "Swap Abstraction" discussion slice 8 in the recent
> > >> > LSF/MM conference.
> > >> >
> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > >> > orders" is introduced, it only allocates the mTHP swap entries
> > >> > from the new empty cluster list.  It has a fragmentation issue
> > >> > reported by Barry.
> > >> >
> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/
> > >> >
> > >> > The reason is that all the empty clusters have been exhausted while
> > >> > there are plenty of free swap entries in the cluster that are
> > >> > not 100% free.
> > >> >
> > >> > Remember the swap allocation order in the cluster.
> > >> > Keep track of the per order non full cluster list for later allocation.
> > >> >
> > >> > This series gives the swap SSD allocation a new separate code path
> > >> > from the HDD allocation. The new allocator use cluster list only
> > >> > and do not global scan swap_map[] without lock any more.
> > >>
> > >> This sounds good.  Can we use SSD allocation method for HDD too?
> > >> We may not need a swap entry allocator optimized for HDD.
> > >
> > > Yes, that is the plan as well. That way we can completely get rid of
> > > the old scan_swap_map_slots() code.
> >
> > Good!
> >
> > > However, considering the size of the series, let's focus on the
> > > cluster allocation path first, get it tested and reviewed.
> >
> > OK.
> >
> > > For HDD optimization, mostly just the new block allocations portion
> > > need some separate code path from the new cluster allocator to not do
> > > the per cpu allocation.  Allocating from the non free list doesn't
> > > need to change too
> >
> > I suggest not consider HDD optimization at all.  Just use SSD algorithm
> > to simplify.
>
> Adding a global next allocating CI rather than the per CPU next CI
> pointer is pretty trivial as well. It is just a different way to fetch
> the next cluster pointer.

Yes, if we enable the new cluster based allocator for HDD, we can
enable THP and mTHP for HDD too, and use a global cluster_next instead
of Per-CPU for it.
It's easy to do with minimal changes, and should actually boost
performance for HDD SWAP. Currently testing this locally.

> > >>
> > >> Hi, Hugh,
> > >>
> > >> What do you think about this?
> > >>
> > >> > This streamline the swap allocation for SSD. The code matches the
> > >> > execution flow much better.
> > >> >
> > >> > User impact: For users that allocate and free mix order mTHP swapping,
> > >> > It greatly improves the success rate of the mTHP swap allocation after the
> > >> > initial phase.
> > >> >
> > >> > It also performs faster when the swapfile is close to full, because the
> > >> > allocator can get the non full cluster from a list rather than scanning
> > >> > a lot of swap_map entries.
> > >>
> > >> Do you have some test results to prove this?  Or which test below can
> > >> prove this?
> > >
> > > The two zram tests are already proving this. The system time
> > > improvement is about 2% on my low CPU count machine.
> > > Kairui has a higher core count machine and the difference is higher
> > > there. The theory is that higher CPU count has higher contentions.
> >
> > I will interpret this as the performance is better in theory.  But
> > there's almost no measurable results so far.
>
> I am trying to understand why don't see the performance improvement in
> the zram setup in my cover letter as a measurable result?

Hi Ying, you can check the test with the 32 cores AMD machine in the
cover letter, as Chris pointed out the performance gain is higher as
core number grows. The performance gain is still not much (*yet, based
on this design thing can go much faster after HDD codes are
dropped which enables many other optimizations, this series
is mainly focusing on the fragmentation issue), but I think a
stable ~4 - 8% improvement with a build linux kernel test
could be considered measurable?