System responsiveness issue (patch 4 to 6) =============================== After applying patches 1 to 3, I encountered severe responsiveness degradation while zswap global shrinker is running under heavy memory pressure. Visible issue to resolve ------------------------------- The visible issue happens with patches 1 to 3 applied when a large amount of memory allocation happens and zswap cannot store the incoming pages. While global shrinker is writing back pages, system stops responding as if under heavy memory thrashing. This issue is less likely to happen without patches 1 to 3 or zswap is disabled. I believe this is because the global shrinker could not write back a meaningful amount of pages, as described in patch 2. Root cause and changes to resolve the issue ------------------------------- It seems that zswap shrinker blocking IO for memory reclaim and faults is the root cause of this responsiveness issue. I introduced three patches to reduce possible blocking in the following problematic situations: 1. Contention on workqueue thread pools by shrink_worker() using WQ_MEM_RECLAIM unnecessarily. Although the shrinker runs simultaneously with memory reclaim, shrinking is not required to reclaim memory since zswap_store() can reject pages without interfering with memory reclaim progress. shrink_worker() should not use WQ_MEM_RECLAIM and should be delayed when another work in WQ_MEM_RECLAIM is reclaiming memory. The existing code requires allocating memory inside shrink_worker(), potentially blocking other latency-sensitive reclaim work. 2. Contention on swap IO. Since zswap_writeback_entry() performs write IO in 4KB pages, it consumes a lot of IOPS, increasing the IO latency of swapout/in. We should not perform IO for background shrinking while zswap_store() is rejecting pages or zswap_load() is failing to find stored pages. This series implements two mitigation logics to reduce the IO contention: 2-a. Do not reject pages in zswap_store(). This is mostly achieved by patch 3. With patch 3, zswap can prepare space proactively and accept pages while the global shrinker is running. To avoid rejection further, patch 5 (store incompressible pages) is added. This reduces rejection by storing incompressible pages. When zsmalloc is used, we can accept incompressible pages with small memory overhead. It is a minor optimization, but I think it is worth implementing. This does not improve performance on current zbud but does not incur a performance penalty. 2-b. Interrupt writeback while pagein/out. Once zswap runs out of prepared space, we cannot accept incoming pages, incurring direct writes to the swap disk. At this moment, the shrinker is proactively evicting pages, leading to IO contention with memory reclaim. Performing low-priority IO is straightforward but requires reimplementing a low-priority version of __swap_writepage(). Instead, in patch 6, I implemented a heuristic, delaying the next zswap writeback based on the elapsed time since zswap_store() rejected a page. When zswap_store() hits the max pool size and rejects pages, swap_writepage() immediately performs the writeback to disk. The time jiffies is saved to tell shrink_worker() to sleep up to ZSWAP_GLOBAL_SHRINK_DELAY msec. The same logic applied to zswap_load(). When zswap cannot find a page in the stored pool, pagein requires read IO from the swap device. The global shrinker should be interrupted here. This patch proposes a constant delay of 500 milliseconds, aligning with the mq-deadline target latency. Visible change ------------------------------- With patches 4 to 6, the global shrinker pauses the writeback while pagein/out operations are using the swap device. This change reduces resource contention and makes memory reclaim/faults complete faster, thereby reducing system responsiveness degradation. Intended scenario for memory reclaim: 1. zswap pool < accept_threshold as the initial state. This is achieved by patch 3, proactive shrinking. 2. Active processes start allocating pages. Pageout is buffered by zswap without IO. 3. zswap reaches shrink_start_threshold. zswap continues to buffer incoming pages and starts writeback immediately in the background. 4. zswap reaches max pool size. zswap interrupts the global shrinker and starts rejecting pages. Write IO for the rejected page will consume all IO resources. 5. Active processes stop allocating pages. After the delay, the shrinker resumes writeback until the accept threshold. Benchmark ------------------------------- To demonstrate that the shrinker writeback is not interfering with pagein/out operations, I measured the elapsed time of allocating 2GB of 3/4 compressible data by a Python script, averaged over 10 runs: | | elapsed| user | sys | |----------------------|--------|-------|-------| | With patches 1 to 3 | 13.10 | 0.183 | 2.049 | | With all patches | 11.17 | 0.116 | 1.490 | | zswap off (baseline) | 11.81 | 0.149 | 1.381 | Although this test cannot distinguish responsiveness issues caused by zswap writeback from normal memory thrashing between plain pagein/out, the difference from the baseline indicates that the patches reduced performance degradation on pageout caused by zswap writeback. The tests were run on kernel 6.10-rc5 on a VM with 1GB RAM (idling Azure VM with persistent block swap device), 2 vCPUs, zsmalloc/lz4, 25% max pool, and 50% accept threshold. --- Takero Funaki (6): mm: zswap: fix global shrinker memcg iteration mm: zswap: fix global shrinker error handling logic mm: zswap: proactive shrinking before pool size limit is hit mm: zswap: make writeback run in the background mm: zswap: store incompressible page as-is mm: zswap: interrupt shrinker writeback while pagein/out IO Documentation/admin-guide/mm/zswap.rst | 17 +- mm/zswap.c | 264 ++++++++++++++++++++----- 2 files changed, 219 insertions(+), 62 deletions(-) -- 2.43.0