Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink

Nhat Pham <nphamcs@xxxxxxxxx> · Fri, 12 Jul 2024 16:02:27 -0700

On Wed, Jul 10, 2024 at 3:27 PM Takero Funaki <flintglass@xxxxxxxxx> wrote:
>
> 2024年7月9日(火) 9:53 Nhat Pham <nphamcs@xxxxxxxxx>:
>
> > > post-patch, 6.10-rc4 with patch 1 to 5
> >
> > You mean 1 to 6? There are 6 patches, no?
>
> oops. with patches 1 to 6.
>
> >
> > Just out of pure curiosity, could you include the stats from patch 1-3 only?
> >
>
> I will rerun the bench in v3. I assume this bench does not reflect
> patches 4 to 6, as delta pool_limit_hit=0 means no rejection from
> zswap.
>
> > Ah this is interesting. Did you actually see improvement in your real
> > deployment (i.e not the benchmark) with patch 4-6 in?
> >
>
> As replied in patch 6, memory consuming tasks like `apt upgrade` for instance.
>
> > >
> > > Intended scenario for memory reclaim:
> > > 1. zswap pool < accept_threshold as the initial state. This is achieved
> > >    by patch 3, proactive shrinking.
> > > 2. Active processes start allocating pages. Pageout is buffered by zswap
> > >    without IO.
> > > 3. zswap reaches shrink_start_threshold. zswap continues to buffer
> > >    incoming pages and starts writeback immediately in the background.
> > > 4. zswap reaches max pool size. zswap interrupts the global shrinker and
> > >    starts rejecting pages. Write IO for the rejected page will consume
> > >    all IO resources.
> >
> > This sounds like the proactive shrinker is still not aggressive
> > enough, and/or there are some sort of misspecifications of the zswap
> > setting... Correct me if I'm wrong, but the new proactive global
> > shrinker begins 1% after the acceptance threshold, and shrinks down to
> > acceptance threshold, right? How are we still hitting the pool
> > limit...
> >
>
> Proactive shrinking should not be aggressive. With patches 4 and 6, I
> modified the global shrinker to be less aggressive against pagein/out.
> Shrinking proactively cannot avoid hitting the pool limit when memory
> pressure grows faster.
>
> > My concern is that we are knowingly (and perhaps unnecessarily)
> > creating an LRU inversion here - preferring swapping out the rejected
> > pages over the colder pages in the zswap pool. Shouldn't it be the
> > other way around? For instance, can we spiral into the following
> > scenario:
> >
> > 1. zswap pool becomes full.
> > 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> > keeps rejecting incoming pages, and putting a hold on the global
> > shrinker.
> > 3. The pages that are swapped out are warmer than the ones stored in
> > the zswap pool, so they will be more likely to be swapped in (which,
> > IIUC, will also further delay the global shrinker).
> >
> > and the cycle keeps going on and on?
>
> I agree this does not follow LRU, but I think the LRU priority
> inversion is unavoidable once the pool limit is hit.
> The accept_thr_percent should be lowered to reduce the probability of
> LRU inversion if it matters. (it is why I implemented proactive
> shrinker.)

And yet, in your own benchmark it fails to prevent that, no? I think
you lower it all the way down to 50%.

>
> When the writeback throughput is slower than memory usage grows,
> zswap_store() will have to reject pages sooner or later.
> If we evict the oldest stored pages synchronously before rejecting a
> new page (rotating pool to keep LRU), it will affect latency depending
> how much writeback is required to store the new page. If the oldest
> pages were compressed well, we would have to evict too many pages to
> store a warmer page, which blocks the reclaim progress. Fragmentation
> in the zspool may also increase the required writeback amount.
> We cannot accomplish both maintaining LRU priority and maintaining
> pageout latency.

Hmm yeah, I guess this is fair. Looks like there is not a lot of
choice, if you want to maintain decent pageout latency...

I could suggest that you have a budgeted zswap writeback on zswap
store - i.e if the pool is full, then try to zswap writeback until we
have enough space or if the budget is reached. But that feels like
even more engineering - the IO priority approach might even be easier
at that point LOL.

Oh well, global shrinker delay it is :)

>
> Additionally, zswap_writeback_entry() is slower than direct pageout. I
> assume this is because shrinker performs 4KB IO synchronously. I am
> seeing shrinking throughput is limited by disk IOPS * 4KB while much
> higher throughput can be achieved by disabling zswap. direct pageout
> can be faster than zswap writeback, possibly because of bio
> optimization or sequential allocation of swap.

Hah, this is interesting!

I wonder though, if the solution here is to perform some sort of
batching for zswap writeback.

BTW, what is the type of the storage device you are using for swap? Is
it SSD or HDD etc?

>
>
> > Have you experimented with synchronous reclaim in the case the pool is
> > full? All the way to the acceptance threshold is too aggressive of
> > course - you might need to find something in between :)
> >
>
> I don't get what the expected situation is.
> The benchmark of patch 6 is performing synchronous reclaim in the case
> the pool is full, since bulk memory allocation (write to mmapped
> space) is much faster than writeback throughput. The zswap pool is
> filled instantly at the beginning of benchmark runs. The
> accept_thr_percent is not significant for the benchmark, I think.

No. I meant synchronous reclaim as in triggering zswap writeback
within the zswap store path, to make space for the incoming new zswap
pages. But you already addressed it above :)

>
>
> >
> > I wonder if this contention would show up in PSI metrics
> > (/proc/pressure/io, or the cgroup variants if you use them ). Maybe
> > correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.)
> > with IO pressure to show the pattern, i.e the contention problem was
> > there before, and is now resolved? :)
>
> Unfortunately, I could not find a reliable metric other than elapsed
> time. It seems PSI does not distinguish stalls for rejected pageout
> from stalls for shrinker writeback.
> For counters, this issue affects latency but does not increase the
> number of pagein/out. Is there any better way to observe the origin of
> contention?
>
> Thanks.