Re: [PATCH v2 0/6] mm: zswap: global shrinker fix and proactive shrink

Takero Funaki <flintglass@xxxxxxxxx> · Thu, 11 Jul 2024 07:26:55 +0900

2024年7月9日(火) 9:53 Nhat Pham <nphamcs@xxxxxxxxx>:

> > post-patch, 6.10-rc4 with patch 1 to 5
>
> You mean 1 to 6? There are 6 patches, no?

oops. with patches 1 to 6.

>
> Just out of pure curiosity, could you include the stats from patch 1-3 only?
>

I will rerun the bench in v3. I assume this bench does not reflect
patches 4 to 6, as delta pool_limit_hit=0 means no rejection from
zswap.

> Ah this is interesting. Did you actually see improvement in your real
> deployment (i.e not the benchmark) with patch 4-6 in?
>

As replied in patch 6, memory consuming tasks like `apt upgrade` for instance.

> >
> > Intended scenario for memory reclaim:
> > 1. zswap pool < accept_threshold as the initial state. This is achieved
> >    by patch 3, proactive shrinking.
> > 2. Active processes start allocating pages. Pageout is buffered by zswap
> >    without IO.
> > 3. zswap reaches shrink_start_threshold. zswap continues to buffer
> >    incoming pages and starts writeback immediately in the background.
> > 4. zswap reaches max pool size. zswap interrupts the global shrinker and
> >    starts rejecting pages. Write IO for the rejected page will consume
> >    all IO resources.
>
> This sounds like the proactive shrinker is still not aggressive
> enough, and/or there are some sort of misspecifications of the zswap
> setting... Correct me if I'm wrong, but the new proactive global
> shrinker begins 1% after the acceptance threshold, and shrinks down to
> acceptance threshold, right? How are we still hitting the pool
> limit...
>

Proactive shrinking should not be aggressive. With patches 4 and 6, I
modified the global shrinker to be less aggressive against pagein/out.
Shrinking proactively cannot avoid hitting the pool limit when memory
pressure grows faster.

> My concern is that we are knowingly (and perhaps unnecessarily)
> creating an LRU inversion here - preferring swapping out the rejected
> pages over the colder pages in the zswap pool. Shouldn't it be the
> other way around? For instance, can we spiral into the following
> scenario:
>
> 1. zswap pool becomes full.
> 2. Memory is still tight, so anonymous memory will be reclaimed. zswap
> keeps rejecting incoming pages, and putting a hold on the global
> shrinker.
> 3. The pages that are swapped out are warmer than the ones stored in
> the zswap pool, so they will be more likely to be swapped in (which,
> IIUC, will also further delay the global shrinker).
>
> and the cycle keeps going on and on?

I agree this does not follow LRU, but I think the LRU priority
inversion is unavoidable once the pool limit is hit.
The accept_thr_percent should be lowered to reduce the probability of
LRU inversion if it matters. (it is why I implemented proactive
shrinker.)

When the writeback throughput is slower than memory usage grows,
zswap_store() will have to reject pages sooner or later.
If we evict the oldest stored pages synchronously before rejecting a
new page (rotating pool to keep LRU), it will affect latency depending
how much writeback is required to store the new page. If the oldest
pages were compressed well, we would have to evict too many pages to
store a warmer page, which blocks the reclaim progress. Fragmentation
in the zspool may also increase the required writeback amount.
We cannot accomplish both maintaining LRU priority and maintaining
pageout latency.

Additionally, zswap_writeback_entry() is slower than direct pageout. I
assume this is because shrinker performs 4KB IO synchronously. I am
seeing shrinking throughput is limited by disk IOPS * 4KB while much
higher throughput can be achieved by disabling zswap. direct pageout
can be faster than zswap writeback, possibly because of bio
optimization or sequential allocation of swap.

> Have you experimented with synchronous reclaim in the case the pool is
> full? All the way to the acceptance threshold is too aggressive of
> course - you might need to find something in between :)
>

I don't get what the expected situation is.
The benchmark of patch 6 is performing synchronous reclaim in the case
the pool is full, since bulk memory allocation (write to mmapped
space) is much faster than writeback throughput. The zswap pool is
filled instantly at the beginning of benchmark runs. The
accept_thr_percent is not significant for the benchmark, I think.

>
> I wonder if this contention would show up in PSI metrics
> (/proc/pressure/io, or the cgroup variants if you use them ). Maybe
> correlate reclaim counters (pgscan, zswpout, pswpout, zswpwb etc.)
> with IO pressure to show the pattern, i.e the contention problem was
> there before, and is now resolved? :)

Unfortunately, I could not find a reliable metric other than elapsed
time. It seems PSI does not distinguish stalls for rejected pageout
from stalls for shrinker writeback.
For counters, this issue affects latency but does not increase the
number of pagein/out. Is there any better way to observe the origin of
contention?

Thanks.