On Mon, Aug 5, 2024 at 4:22 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > Current zswap shrinker's heuristics to prevent overshrinking is brittle > and inaccurate, specifically in the way we decay the protection size > (i.e making pages in the zswap LRU eligible for reclaim). > > We currently decay protection aggressively in zswap_lru_add() calls. > This leads to the following unfortunate effect: when a new batch of > pages enter zswap, the protection size rapidly decays to below 25% of > the zswap LRU size, which is way too low. > > We have observed this effect in production, when experimenting with the > zswap shrinker: the rate of shrinking shoots up massively right after a > new batch of zswap stores. This is somewhat the opposite of what we want > originally - when new pages enter zswap, we want to protect both these > new pages AND the pages that are already protected in the zswap LRU. > > Replace existing heuristics with a second chance algorithm > > 1. When a new zswap entry is stored in the zswap pool, its referenced > bit is set. > 2. When the zswap shrinker encounters a zswap entry with the referenced > bit set, give it a second chance - only flips the referenced bit and > rotate it in the LRU. > 3. If the shrinker encounters the entry again, this time with its > referenced bit unset, then it can reclaim the entry. > > In this manner, the aging of the pages in the zswap LRUs are decoupled > from zswap stores, and picks up the pace with increasing memory pressure > (which is what we want). > > The second chance scheme allows us to modulate the writeback rate based > on recent pool activities. Entries that recently entered the pool will > be protected, so if the pool is dominated by such entries the writeback > rate will reduce proportionally, protecting the workload's workingset.On > the other hand, stale entries will be written back quickly, which > increases the effective writeback rate. > > The referenced bit is added at the hole after the `length` field of > struct zswap_entry, so there is no extra space overhead for this > algorithm. > > We will still maintain the count of swapins, which is consumed and > subtracted from the lru size in zswap_shrinker_count(), to further > penalize past overshrinking that led to disk swapins. The idea is that > had we considered this many more pages in the LRU active/protected, they > would not have been written back and we would not have had to swapped > them in. > > To test this new heuristics, I built the kernel under a cgroup with > memory.max set to 2G, on a host with 36 cores: > > With the old shrinker: > > real: 263.89s > user: 4318.11s > sys: 673.29s > swapins: 227300.5 > > With the second chance algorithm: > > real: 244.85s > user: 4327.22s > sys: 664.39s > swapins: 94663 > > (average over 5 runs) > > We observe an 1.3% reduction in kernel CPU usage, and around 7.2% > reduction in real time. Note that the number of swapped in pages > dropped by 58%. > > Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx> > --- > include/linux/zswap.h | 16 +++---- > mm/zswap.c | 108 ++++++++++++++++++++++++------------------ > 2 files changed, 70 insertions(+), 54 deletions(-) > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index 6cecb4a4f68b..9cd1beef0654 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages; > > struct zswap_lruvec_state { > /* > - * Number of pages in zswap that should be protected from the shrinker. > - * This number is an estimate of the following counts: > + * Number of swapped in pages from disk, i.e not found in the zswap pool. > * > - * a) Recent page faults. > - * b) Recent insertion to the zswap LRU. This includes new zswap stores, > - * as well as recent zswap LRU rotations. > - * > - * These pages are likely to be warm, and might incur IO if the are written > - * to swap. > + * This is consumed and subtracted from the lru size in > + * zswap_shrinker_count() to penalize past overshrinking that led to disk > + * swapins. The idea is that had we considered this many more pages in the > + * LRU active/protected and not written them back, we would not have had to > + * swapped them in. > */ > - atomic_long_t nr_zswap_protected; > + atomic_long_t nr_disk_swapins; > }; > > unsigned long zswap_total_pages(void); > diff --git a/mm/zswap.c b/mm/zswap.c > index adeaf9c97fde..fb3d9cb88785 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -187,6 +187,10 @@ static struct shrinker *zswap_shrinker; > * length - the length in bytes of the compressed page data. Needed during > * decompression. For a same value filled page length is 0, and both > * pool and lru are invalid and must be ignored. > + * referenced - true if the entry recently entered the zswap pool. Unset by the > + * dynamic shrinker. The entry is only reclaimed by the dynamic > + * shrinker if referenced is unset. See comments in the shrinker > + * section for context. Nit: It is unset and reclaimed by the writeback logic in general, which isn't necessarily triggered from the dynamic shrinker, right? > * pool - the zswap_pool the entry's data is in > * handle - zpool allocation handle that stores the compressed page data > * value - value of the same-value filled pages which have same content > @@ -196,6 +200,7 @@ static struct shrinker *zswap_shrinker; > struct zswap_entry { > swp_entry_t swpentry; > unsigned int length; > + bool referenced; > struct zswap_pool *pool; > union { > unsigned long handle; > @@ -700,11 +705,8 @@ static inline int entry_to_nid(struct zswap_entry *entry) > > static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) > { > - atomic_long_t *nr_zswap_protected; > - unsigned long lru_size, old, new; > int nid = entry_to_nid(entry); > struct mem_cgroup *memcg; > - struct lruvec *lruvec; > > /* > * Note that it is safe to use rcu_read_lock() here, even in the face of > @@ -722,19 +724,6 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) > memcg = mem_cgroup_from_entry(entry); > /* will always succeed */ > list_lru_add(list_lru, &entry->lru, nid, memcg); > - > - /* Update the protection area */ > - lru_size = list_lru_count_one(list_lru, nid, memcg); > - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); > - nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected; > - old = atomic_long_inc_return(nr_zswap_protected); > - /* > - * Decay to avoid overflow and adapt to changing workloads. > - * This is based on LRU reclaim cost decaying heuristics. > - */ > - do { > - new = old > lru_size / 4 ? old / 2 : old; > - } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new)); Nice, arcane heuristics gone :) LGTM with the above nit: Acked-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>