From: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> The memory shrinker of zswap is special: it has to first allocate more memory (folio) to free less memory (compressed copy), later that folio can be freed after written back. So it's better to evict these folios before trying to reclaim other folios. Here may come some problems of LRU management: 1. zswap_writeback_entry() reuses the __read_swap_cache_async(), which is normally only used in the swapin context. This may cause refault accounting and active/workingset information of the folio to be wrong. For example, zswap shrinker writeback try to reclaim the folio, but workingset_refault() mark it active to put it at head of active list. 2. folio_end_writeback() will check PG_reclaim flag which we did set in zswap_writeback_entry(), to try to rotate the folio to the tail of inactive list, to speed up its reclaim. But folio_rotate_reclaimable() won't work as we thought, actually all LRU move interfaces may don't work when the folio is isolated from LRU. (per-cpu add batch is somewhat like isolated from LRU) So when folio_end_writeback() calls folio_rotate_reclaimable(), it won't do nothing but just clear PG_reclaim flag if that folio is isolated (in per-cpu add batch or isolated by vmscan shrinker) 3. so the final result is the folio that has been written back and is expected to be evicted, but now is not at tail of inactive list. Meanwhile vmscan shrinker may try to evict other folios to cause more refaults. There is a report [1] of this problem. We should handle these cases better. First of all, we should consider when and where to put these folios on LRU. 1. after alloc: now we use folio_add_lru() to put folio on local batch, so it will be put at head of inactive/active list when batch drain. 2. after writeback: clear PG_reclaim and folio_rotate_reclaimable(). - after add batch drain: rotate successfully to tail of inactive list - before add batch drain: do nothing since folio is not on LRU list So these are two main time points we care about: the first is somewhat correct IMHO since the folio is under writeback and has PG_reclaim set, it may confuse shrinker if we put those to the tail of inactive list too early. If we really want to put it to tail, we can easily introduce another folio_add_lru_tail() to put on a new local lru_add_tail batch. The second is where we need to improve, we should rotate it to tail even when folio is in local batch. Since we can't manipulate folio LRU status when it's isolated in local batch, an obvious fix is to use folio flag to tell later lru_add_fn() where the folio should be put: active or inactive, head or tail. But the problem is that PG_readahead is the alias of PG_reclaim, we can't put readahead folio to the tail of inactive list obviously. So this patch changes folio_rotate_reclaimable() to queue to local rotate batch even when !PG_lru at first, hoping that: - folio_end_writeback() finish on the same cpu with lru_add batch, so it must be handled after the lru_add batch, after which it will see PG_lru and successfully rotate it to tail of inactive list. - even folio_end_writeback() is not finished on the same cpu, there maybe a big chance that rotate batch is handled after add batch. Testing kernel build in tmpfs with memory.max = 2GB. (zswap shrinker and writeback enabled with one 50GB swapfile) mm-unstable-hot zswap-lru-reclaim real 63.34 62.72 user 1063.20 1060.30 sys 272.04 256.14 workingset_refault_anon 2103297.00 1788155.80 workingset_refault_file 28638.20 39249.40 workingset_activate_anon 746134.00 695435.40 workingset_activate_file 4344.60 4255.80 workingset_restore_anon 653163.80 605315.60 workingset_restore_file 1079.00 883.00 workingset_nodereclaim 0.00 0.00 pgscan 12971305.60 12730331.20 pgscan_kswapd 0.00 0.00 pgscan_direct 12971305.60 12730331.20 pgscan_khugepaged 0.00 0.00 We can see that refault and sys cpu have some improvements. As for the workingset_refault() caused by zswap writeback, maybe we should remove it in zswap writeback case, but there are more pgscan and some regression. I don't know why, so just leave it as it is. This is RFC, any comment or discussion is welcome! Thanks! [1] https://lore.kernel.org/all/20231024142706.195517-1-hezhongkun.hzk@xxxxxxxxxxxxx/ Chengming Zhou (1): mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru() mm/swap.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- 2.40.1