Re: [PATCHv2] zsmalloc: allow only one active pool compaction context

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Mon, 17 Apr 2023 19:53:21 -0700

On Mon, Apr 17, 2023 at 5:41 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, 17 Apr 2023 22:54:20 +0900 Sergey Senozhatsky <senozhatsky@xxxxxxxxxxxx> wrote:
>
> > zsmalloc pool can be compacted concurrently by many contexts,
> > e.g.
> >
> >  cc1 handle_mm_fault()
> >       do_anonymous_page()
> >        __alloc_pages_slowpath()
> >         try_to_free_pages()
> >          do_try_to_free_pages(
> >           lru_gen_shrink_node()
> >            shrink_slab()
> >             do_shrink_slab()
> >              zs_shrinker_scan()
> >               zs_compact()
> >
> > This creates unnecessary contention as all those processes
> > compete for access to the same classes. A single compaction
> > process is enough. Moreover contention that is created by
> > multiple compaction processes impact other zsmalloc functions,
> > e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to
> > synchronize access to pool.
> >
> > Introduce pool compaction mutex and permit only one compaction
> > context at a time. This reduces overall pool->lock contention.
>
> That isn't what the patch does!  Perhaps an earlier version used a mutex?

Yup that's from v1. Thanks for catching that.

>
> > /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
> > &pool->lock#3:
> >
> >                 Base           Patched
> > ------------------------------------------
> > con-bounces     2035730        1540066
> > contentions     2343871        1774348
> > waittime-min    0.10           0.10
> > waittime-max    4004216.24     2745.22
> > waittime-total  101334168.29   67865414.91
> > waittime-avg    43.23          38.25
> > acq-bounces     2895765        2186745
> > acquisitions    6247686        5136943
> > holdtime-min    0.07           0.07
> > holdtime-max    2605507.97     482439.16
> > holdtime-total  9998599.59     5107151.01
> > holdtime-avg    1.60           0.99
> >
> > Test run time:
> > Base
> > 2775.15user 1709.13system 2:13.82elapsed 3350%CPU
> >
> > Patched
> > 2608.25user 1439.03system 2:03.63elapsed 3273%CPU
> >
> > ...
> >
> > @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool)
> >       struct size_class *class;
> >       unsigned long pages_freed = 0;
> >
> > +     if (atomic_xchg(&pool->compaction_in_progress, 1))
> > +             return 0;
> > +
>
> A code comment might be appropriate here.
>
> Is the spin_is_contended() test in __zs_compact() still relevant?

It is, as pool->lock is used to synchronize a lot of operations, not
just compaction.

>
> And....  single-threading the operation seems a pretty sad way of
> addressing a contention issue.  zs_compact() is fairly computationally
> expensive - surely a large machine would like to be able to
> concurrently run many instances of zs_compact()?

Yeah that was my initial thought as well, to attack this problem at a
larger scale and break down the pool->lock into class->lock. AFAICT,
zsmalloc pages/objects cannot change classes during their lifetime, so
this sounds reasonable initially.

The problem is that the pool->lock is used to synchronize ~all
zsmalloc operations:  zs_map_object(), zs_malloc(), zs_free(),
async_free_zspage(), zs_compact(), zs_reclaim_page(),
zs_page_migrate().

For some of those, changing a pool->lock into a class->lock is fairly
straightforward, specifically the ones where we operate on a
page/zspage (e.g. zs_compact()).

The problem is for operations where we operate on an object (e.g.
zs_map_object()). At this point, we only have a handle that we need to
use to find the underlying object. We have no idea what class the
object belongs to, and finding an object given a handle needs
synchronization as the handle<->object mapping can change by
compaction or page migration.

We can store the class somewhere and pay an extra byte per compressed
object (which doesn't sound too bad) -- but even then we'd have
trouble finding the right place to stash it. Perhaps we can enforce
some alignment on the objects and use the lowermost 8 bits of the
handle pointer, which also means we don't need to pay any extra
overhead, but I am not sure if this is practical.

We can also store the class at the first byte of the object, and
guarantee that through migration/compaction the first byte always
remains a valid class. We can then read the class from the handle
locklessly, acquire the class lock, and then make sure the handle is
still the same.

These are all ideas I am throwing in the air, I didn't think about
them thoroughly or try implementing them.

TLDR: I believe it is possible (in theory at least) to break down the
global pool->lock into class locks, which should benefit virtually
~all zsmalloc concurrent operations, not just compaction. It is not
easy though.

As for this patch, I personally do not observe a lot of compaction in
our production environment, and allowing one thread to perform
compaction while others move on with their lives can be better than
having all of them continuously contending for the pool->lock, which
means more contention with ~all zsmalloc operations, not just
concurrent compactors. I can't say for sure that this is an
improvement, but I *believe* it is.

I will shut up now and leave it up to Sergey to decide how to move
forward, sorry for the very long response :)