Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams

Sergey Senozhatsky <senozhatsky@xxxxxxxxxxxx> · Thu, 6 Feb 2025 16:22:27 +0900

On (25/02/06 14:55), Kairui Song wrote:
> > On (25/02/01 17:21), Kairui Song wrote:
> > > This seems will cause a huge regression of performance on multi core
> > > systems, this is especially significant as the number of concurrent
> > > tasks increases:
> > >
> > > Test build linux kernel using ZRAM as SWAP (1G memcg):
> > >
> > > Before:
> > > + /usr/bin/time make -s -j48
> > > 2495.77user 2604.77system 2:12.95elapsed 3836%CPU (0avgtext+0avgdata
> > > 863304maxresident)k
> > >
> > > After:
> > > + /usr/bin/time make -s -j48
> > > 2403.60user 6676.09system 3:38.22elapsed 4160%CPU (0avgtext+0avgdata
> > > 863276maxresident)k
> >
> > How many CPUs do you have?  I assume, preemption gets into way which is
> > sort of expected, to be honest...  Using per-CPU compression streams
> > disables preemption and uses CPU exclusively at a price of other tasks
> > not being able to run.  I do tend to think that I made a mistake by
> > switching zram to per-CPU compression streams.
> >
> > What preemption model do you use and to what extent do you overload
> > your system?
> >
> > My tests don't show anything unusual (but I don't overload the system)
> >
> > CONFIG_PREEMPT
> 
> I'm using CONFIG_PREEMPT_VOLUNTARY=y, and there are 96 logical CPUs
> (48c96t), make -j48 shouldn't be considered overload I think. make
> -j32 also showed an obvious slow down.

Hmm, there should be more than enough compression streams then, the
limit is num_online_cpus.  That's strange.  I wonder if that's zsmalloc
handle allocation ("remove two-staged handle allocation" in the series.)

[..]
> > Hmm it's just
> >
> >         spin_lock()
> >         list first entry
> >         spin_unlock()
> >
> > Shouldn't be "a big spin lock", that's very odd.  I'm not familiar with
> > perf lock contention, let me take a look.
> 
> I can debug this a bit more to figure out why the contention is huge
> later

That will be appreciated, thank you.

> but my first thought is that, as Yosry also mentioned in
> another reply, making it preemptable doesn't necessarily mean the per
> CPU stream has to be gone.

Was going to reply to Yosry's email today/tomorrow, didn't have time to
look into, but will reply here.

So for spin-lock contention - yes, but that lock really should not
be so visible.  Other than that we limit the number of compression
streams to the number of the CPUs and permit preemption, so it should
be the same as the "preemptible per-CPU" streams, roughly.  The
difference, perhaps, is that we don't pre-allocate streams, but
allocate only as needed.  This has two sides: one side is that later
allocations can fail, but the other side is that we don't allocate
streams that we don't use.  Especially secondary streams (priority 1
and 2, which are used for recompression).  I didn't know it was possible
to use per-CPU data and still have preemption enabled at the same time.
So I'm not opposed to the idea of still having per-CPU streams and do
what zswap folks did.