[no subject]

**Date** **Thread**

System responsiveness issue (patch 4 to 6)
===============================
After applying patches 1 to 3, I encountered severe responsiveness
degradation while zswap global shrinker is running under heavy memory
pressure.

Visible issue to resolve
-------------------------------
The visible issue happens with patches 1 to 3 applied when a large
amount of memory allocation happens and zswap cannot store the incoming
pages.
While global shrinker is writing back pages, system stops responding as
if under heavy memory thrashing.

This issue is less likely to happen without patches 1 to 3 or zswap is
disabled. I believe this is because the global shrinker could not write
back a meaningful amount of pages, as described in patch 2.

Root cause and changes to resolve the issue
-------------------------------
It seems that zswap shrinker blocking IO for memory reclaim and faults
is the root cause of this responsiveness issue. I introduced three
patches to reduce possible blocking in the following problematic
situations:

1. Contention on workqueue thread pools by shrink_worker() using
WQ_MEM_RECLAIM unnecessarily.

Although the shrinker runs simultaneously with memory reclaim, shrinking
is not required to reclaim memory since zswap_store() can reject pages
without interfering with memory reclaim progress. shrink_worker() should
not use WQ_MEM_RECLAIM and should be delayed when another work in
WQ_MEM_RECLAIM is reclaiming memory. The existing code requires
allocating memory inside shrink_worker(), potentially blocking other
latency-sensitive reclaim work.

2. Contention on swap IO.

Since zswap_writeback_entry() performs write IO in 4KB pages, it
consumes a lot of IOPS, increasing the IO latency of swapout/in. We
should not perform IO for background shrinking while zswap_store() is
rejecting pages or zswap_load() is failing to find stored pages. This
series implements two mitigation logics to reduce the IO contention:

2-a. Do not reject pages in zswap_store().
This is mostly achieved by patch 3. With patch 3, zswap can prepare
space proactively and accept pages while the global shrinker is running.

To avoid rejection further, patch 5 (store incompressible pages) is
added. This reduces rejection by storing incompressible pages. When
zsmalloc is used, we can accept incompressible pages with small memory
overhead. It is a minor optimization, but I think it is worth
implementing. This does not improve performance on current zbud but does
not incur a performance penalty.

2-b. Interrupt writeback while pagein/out.
Once zswap runs out of prepared space, we cannot accept incoming pages,
incurring direct writes to the swap disk. At this moment, the shrinker
is proactively evicting pages, leading to IO contention with memory
reclaim.

Performing low-priority IO is straightforward but requires
reimplementing a low-priority version of __swap_writepage(). Instead, in
patch 6, I implemented a heuristic, delaying the next zswap writeback
based on the elapsed time since zswap_store() rejected a page.

When zswap_store() hits the max pool size and rejects pages,
swap_writepage() immediately performs the writeback to disk. The time
jiffies is saved to tell shrink_worker() to sleep up to
ZSWAP_GLOBAL_SHRINK_DELAY msec.

The same logic applied to zswap_load(). When zswap cannot find a page in
the stored pool, pagein requires read IO from the swap device. The
global shrinker should be interrupted here.

This patch proposes a constant delay of 500 milliseconds, aligning with
the mq-deadline target latency.

Visible change
-------------------------------
With patches 4 to 6, the global shrinker pauses the writeback while
pagein/out operations are using the swap device. This change reduces
resource contention and makes memory reclaim/faults complete faster,
thereby reducing system responsiveness degradation. 

Intended scenario for memory reclaim:
1. zswap pool < accept_threshold as the initial state. This is achieved
   by patch 3, proactive shrinking.
2. Active processes start allocating pages. Pageout is buffered by zswap
   without IO.
3. zswap reaches shrink_start_threshold. zswap continues to buffer
   incoming pages and starts writeback immediately in the background.
4. zswap reaches max pool size. zswap interrupts the global shrinker and
   starts rejecting pages. Write IO for the rejected page will consume
   all IO resources.
5. Active processes stop allocating pages. After the delay, the shrinker
   resumes writeback until the accept threshold.

Benchmark
-------------------------------
To demonstrate that the shrinker writeback is not interfering with
pagein/out operations, I measured the elapsed time of allocating 2GB of
3/4 compressible data by a Python script, averaged over 10 runs:

|                      | elapsed| user  | sys   |
|----------------------|--------|-------|-------|
| With patches 1 to 3  | 13.10  | 0.183 | 2.049 |
| With all patches     | 11.17  | 0.116 | 1.490 |
| zswap off (baseline) | 11.81  | 0.149 | 1.381 |

Although this test cannot distinguish responsiveness issues caused by
zswap writeback from normal memory thrashing between plain pagein/out,
the difference from the baseline indicates that the patches reduced
performance degradation on pageout caused by zswap writeback.

The tests were run on kernel 6.10-rc5 on a VM with 1GB RAM (idling Azure
VM with persistent block swap device), 2 vCPUs, zsmalloc/lz4, 25% max
pool, and 50% accept threshold.

---

Takero Funaki (6):
  mm: zswap: fix global shrinker memcg iteration
  mm: zswap: fix global shrinker error handling logic
  mm: zswap: proactive shrinking before pool size limit is hit
  mm: zswap: make writeback run in the background
  mm: zswap: store incompressible page as-is
  mm: zswap: interrupt shrinker writeback while pagein/out IO

 Documentation/admin-guide/mm/zswap.rst |  17 +-
 mm/zswap.c                             | 264 ++++++++++++++++++++-----
 2 files changed, 219 insertions(+), 62 deletions(-)

-- 
2.43.0