Re: [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink

Takero Funaki <flintglass@xxxxxxxxx> · Fri, 14 Jun 2024 13:09:18 +0900

2024年6月14日(金) 0:22 Nhat Pham <nphamcs@xxxxxxxxx>:
>
> Taking a step back from the correctness conversation, could you
> include in the changelog of the patches and cover letter a realistic
> scenario, along with user space-visible metrics that show (ideally all
> 4, but at least some of the following):
>
> 1. A user problem (that affects performance, or usability, etc.) is happening.
>
> 2. The root cause is what we are trying to fix (for e.g in patch 1, we
> are skipping over memcgs unnecessarily in the global shrinker loop).
>
> 3. The fix alleviates the root cause in b)
>
> 4. The userspace-visible problem goes away or is less serious.
>

Thank you for your suggestions.
For quick response before submitting v2,

1.
The visible issue is that pageout/in operations from active processes
are slow when zswap is near its max pool size. This is particularly
significant on small memory systems, where total swap usage exceeds
what zswap can store. This means that old pages occupy most of the
zswap pool space, and recent pages use swap disk directly.

2.
This issue is caused by zswap keeping the pool size near 100%. Since
the shrinker fails to shrink the pool to accept_thr_percent and zswap
rejects incoming pages, rejection occurs more frequently than it
should. The rejected pages are directly written to disk while zswap
protects old pages from eviction, leading to slow pageout/in
performance for recent pages on the swap disk.

3.
If the pool size were shrunk proactively, rejection by pool limit hits
would be less likely. New incoming pages could be accepted as the pool
gains some space in advance, while older pages are written back in the
background. zswap would then be filled with recent pages, as expected
in the LRU logic.

Patch 1 and 2 make the shrinker reduce the pool to accept_thr_percent.
Patch 3 makes zswap_store trigger the shrinker before reaching the max
pool size. With this series, zswap will prepare some space to reduce
the probability of problematic pool_limit_hit situation, thus reducing
slow reclaim and the page priority inversion against LRU.

4.
Once proactive shrinking reduces the pool size, pageouts complete
instantly as long as the space prepared by shrinking can store the
direct reclaim. If an admin sees a large pool_limit_hit, lowering
accept_threshold_percent will improve active process performance.

> I have already hinted in a previous response, but global shrinker is
> rarely triggered in production. There are lots of factors that would
> prevent this from triggering:
>
> 1. Max zswap pool size 20% of memory by default, which is a lot.
>
> 2. Swapfile size also limits the size of the amount of data storable
> in the zswap pool.
>
> 3. Other cgroup constraints (memory.max, memory.zswap.max,
> memory.swap.max) also limit a cgroup's zswap usage.
>
> I do agree that patch 1 at least is fixing a problem, and probably
> patch 2 too but please justify why we are investing in the extra
> complexity to fix this problem in the first place.

Regarding the production workload, I believe we are facing different situations.

My intended workload is low-activity services distributed on small
system like t2.nano, with 0.5G to 1G of RAM. There are a significant
number of pool_limit_hits and the zswap pool usage sometimes stays
near 100% filled by background service processes.

When I evaluated zswap and zramswap, zswap performed well. I suppose
this was because of its LRU. Once old pages occupied zramswap, there
was no way to move pages from zramswap to the swap disk. zswap could
adapt to this situation by writing back old pages in LRU order. We
could benefit from compressed swap storing recent pages and also
utilize overcommit backed by a large swap on disk.

I think if a system has never seen a pool limit hit, there is no need
to manage the costly LRU. In such cases, I will try zramswap.

I am developing these patches by testing with artificially allocating
large memory to fill the zswap pool (emulating background services),
and then invoking another allocation to trigger direct reclaim
(emulating short-live active process).
On real devices, seeing much less pool_limit_hit by lower
accept_thr_percent=50 and higher max_pool_percent=25.
Note that before patch 3, zswap_store() did not count some of the
rejection as pool_limit_hit properly (underestimated). we cannot
compare the counter directly.

I will try to create a more reproducible realistic benchmark, but it
might not be realistic for your workload.