This series addresses some issues and introduces a minor improvement in the zswap global shrinker. This version addresses issues discovered during the review of v1: https://lore.kernel.org/linux-mm/20240608155316.451600-1-flintglass@xxxxxxxxx/ and includes three additional patches to fix another issue uncovered by applying v1. Changes in v2: - Added 3 patches to reduce resource contention. mm: zswap: fix global shrinker memcg iteration: - Change the loop style (Yosry, Nhat, Shakeel) - To not skip online memcg, save zswap_last_shrink to detect cursor change by cleaner. (Yosry, Nhat, Shakeel) mm: zswap: fix global shrinker error handling logic: - Change error code for no-writeback memcg. (Yosry) - Use nr_scanned to check if lru is empty. (Yosry) Changes in v1: mm: zswap: fix global shrinker memcg iteration: - Drop and reacquire spinlock before skipping a memcg. - Add some comment to clarify the locking mechanism. mm: zswap: proactive shrinking before pool size limit is hit: - Remove unneeded check before scheduling work. - Change shrink start threshold to accept_thr_percent + 1%. Patches =============================== 1. Fix the memcg iteration logic that abort iteration on offline memcgs. 2. Fix the error path that aborts on expected error codes. 3. Add proactive shrinking at accept threshold + 1%. 4. Drop the global shrinker workqueue WQ_MEM_RECLAIM flag to not block pageout. 5. Store incompressible pages as-is to accept all pages. 6. Interrupt the shrinker to avoid blocking page-in/out. Patches 1 to 3 should be applied in this order to avoid potential loops caused by the first issue. Patch 3 and later can be applied independently, but the first two issues must be resolved to ensure the shrinker can evict pages. Patches 4 to 6 address resource contention issues in the current shrinker uncovered by applying patch 3. With this series applied, the shrinker will continue to evict pages until the accept_threshold_percent proactively, as documented in patch 3. As a side effect of changes in the hysteresis logic, zswap will no longer reject pages under the max pool limit. The series is rebased on mainline 6.10-rc6. Proactive shrinking (Patch 1 to 3) =============================== Visible issue to resolve ------------------------------- The visible issue with the current global shrinker is that pageout/in operations from active processes are slow when zswap is near its max pool size. This is particularly significant on small memory systems where total swap usage exceeds what zswap can store. This results in old pages occupying most of the zswap pool space, with recent pages using the swap disk directly. Root cause of the issue ------------------------------- This issue is caused by zswap maintaining the pool size near 100%. Since the shrinker fails to shrink the pool to accept_threshold_percent, zswap rejects incoming pages more frequently than it should. The rejected pages are directly written to disk while zswap protects old pages from eviction, leading to slow pageout/in performance for recent pages. Changes to resolve the issue ------------------------------- If the pool size were shrunk proactively, rejection by pool limit hits would be less likely. New incoming pages could be accepted as the pool gains some space in advance, while older pages are written back in the background. zswap would then be filled with recent pages, as expected in the LRU logic. Patches 1 and 2 ensure the shrinker reduces the pool to the accept_threshold_percent. Without patch 2, shrink_worker() stops evicting pages on the 16th encounter with a memcg that has no stored pages (e.g., tiny services managed under systemd). Patch 3 makes zswap_store() trigger the shrinker before reaching the max pool size. With this series, zswap will prepare some space to reduce the probability of problematic pool_limit_hit situations, thus reducing slow reclaim and the page priority inversion against LRU. Visible changes ------------------------------- Once proactive shrinking reduces the pool size, pageouts complete instantly as long as the space prepared by proactive shrinking can store the reclaimed pages. If an admin sees a large pool_limit_hit, lowering the accept_threshold_percent will improve active process performance. The optimal point depends on the tradeoff between active process pageout/in performance and background services' pagein performance. Benchmark ------------------------------- The benchmark below observes active process performance under memory pressure, coexisting with background services occupying the zswap pool. The tests were run on a vm with 2 vCPU, 1GB RAM + 4GB swap. 20% max pool and 50% accept threshould (about 100MB proactive shrink), zsmalloc/lz4. This test prefer active process than background process to demonstrate my intended workload. It runs `apt -qq update` 10 times under a 60MB memcg memory.max constraint to emulate an active process running under pressure. The zswap pool is filled prior to the test by allocating a large memory (~1.5GB) to emulate background inactive services. The counters below are from vmstat and zswap debugfs stats. The times are average seconds using /usr/bin/time -v. pre-patch, 6.10-rc4 | | Start | End | Delta | |--------------------|-------------|-------------|------------| | pool_limit_hit | 845 | 845 | 0 | | pool_total_size | 201539584 | 201539584 | 0 | | stored_pages | 63138 | 63138 | 0 | | written_back_pages | 12 | 12 | 0 | | pswpin | 387 | 32412 | 32025 | | pswpout | 153078 | 197138 | 44060 | | zswpin | 0 | 0 | 0 | | zswpout | 63150 | 63150 | 0 | | zswpwb | 12 | 12 | 0 | | Time | | |-------------------|--------------| | elapsed | 8.473 | | user | 1.881 | | system | 0.814 | post-patch, 6.10-rc4 with patch 1 to 5 | | Start | End | Delta | |--------------------|-------------|-------------|------------| | pool_limit_hit | 81861 | 81861 | 0 | | pool_total_size | 75001856 | 87117824 | 12115968 | | reject_reclaim_fail| 0 | 32 | 32 | | same_filled_pages | 135 | 135 | 0 | | stored_pages | 23919 | 27549 | 3630 | | written_back_pages | 97665 | 106994 | 10329 | | pswpin | 4981 | 8223 | 3242 | | pswpout | 179554 | 188883 | 9329 | | zswpin | 293863 | 590014 | 296151 | | zswpout | 455338 | 764882 | 309544 | | zswpwb | 97665 | 106994 | 10329 | | Time | | |-------------------|--------------| | elapsed | 4.525 | | user | 1.839 | | system | 1.467 | Although the pool_limit_hit is not increased in both cases, zswap_store() rejected pages before this patch. Note that, before this series, zswap_store() did not increment pool_limit_hit on rejection by limit hit hysteresis (only the first few hits were counted).