On Mon, Aug 05, 2024 at 04:22:42PM -0700, Nhat Pham wrote: > Current zswap shrinker's heuristics to prevent overshrinking is brittle > and inaccurate, specifically in the way we decay the protection size > (i.e making pages in the zswap LRU eligible for reclaim). > > We currently decay protection aggressively in zswap_lru_add() calls. > This leads to the following unfortunate effect: when a new batch of > pages enter zswap, the protection size rapidly decays to below 25% of > the zswap LRU size, which is way too low. > > We have observed this effect in production, when experimenting with the > zswap shrinker: the rate of shrinking shoots up massively right after a > new batch of zswap stores. This is somewhat the opposite of what we want > originally - when new pages enter zswap, we want to protect both these > new pages AND the pages that are already protected in the zswap LRU. > > Replace existing heuristics with a second chance algorithm > > 1. When a new zswap entry is stored in the zswap pool, its referenced > bit is set. > 2. When the zswap shrinker encounters a zswap entry with the referenced > bit set, give it a second chance - only flips the referenced bit and > rotate it in the LRU. > 3. If the shrinker encounters the entry again, this time with its > referenced bit unset, then it can reclaim the entry. > > In this manner, the aging of the pages in the zswap LRUs are decoupled > from zswap stores, and picks up the pace with increasing memory pressure > (which is what we want). > > The second chance scheme allows us to modulate the writeback rate based > on recent pool activities. Entries that recently entered the pool will > be protected, so if the pool is dominated by such entries the writeback > rate will reduce proportionally, protecting the workload's workingset.On > the other hand, stale entries will be written back quickly, which > increases the effective writeback rate. > > The referenced bit is added at the hole after the `length` field of > struct zswap_entry, so there is no extra space overhead for this > algorithm. > > We will still maintain the count of swapins, which is consumed and > subtracted from the lru size in zswap_shrinker_count(), to further > penalize past overshrinking that led to disk swapins. The idea is that > had we considered this many more pages in the LRU active/protected, they > would not have been written back and we would not have had to swapped > them in. > > To test this new heuristics, I built the kernel under a cgroup with > memory.max set to 2G, on a host with 36 cores: > > With the old shrinker: > > real: 263.89s > user: 4318.11s > sys: 673.29s > swapins: 227300.5 > > With the second chance algorithm: > > real: 244.85s > user: 4327.22s > sys: 664.39s > swapins: 94663 > > (average over 5 runs) > > We observe an 1.3% reduction in kernel CPU usage, and around 7.2% > reduction in real time. Note that the number of swapped in pages > dropped by 58%. > > Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>