Re: [PATCH v3 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Mon, 5 Aug 2024 17:13:17 -0700

On Mon, Aug 5, 2024 at 4:22 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> Current zswap shrinker's heuristics to prevent overshrinking is brittle
> and inaccurate, specifically in the way we decay the protection size
> (i.e making pages in the zswap LRU eligible for reclaim).
>
> We currently decay protection aggressively in zswap_lru_add() calls.
> This leads to the following unfortunate effect: when a new batch of
> pages enter zswap, the protection size rapidly decays to below 25% of
> the zswap LRU size, which is way too low.
>
> We have observed this effect in production, when experimenting with the
> zswap shrinker: the rate of shrinking shoots up massively right after a
> new batch of zswap stores. This is somewhat the opposite of what we want
> originally - when new pages enter zswap, we want to protect both these
> new pages AND the pages that are already protected in the zswap LRU.
>
> Replace existing heuristics with a second chance algorithm
>
> 1. When a new zswap entry is stored in the zswap pool, its referenced
>    bit is set.
> 2. When the zswap shrinker encounters a zswap entry with the referenced
>    bit set, give it a second chance - only flips the referenced bit and
>    rotate it in the LRU.
> 3. If the shrinker encounters the entry again, this time with its
>    referenced bit unset, then it can reclaim the entry.
>
> In this manner, the aging of the pages in the zswap LRUs are decoupled
> from zswap stores, and picks up the pace with increasing memory pressure
> (which is what we want).
>
> The second chance scheme allows us to modulate the writeback rate based
> on recent pool activities. Entries that recently entered the pool will
> be protected, so if the pool is dominated by such entries the writeback
> rate will reduce proportionally, protecting the workload's workingset.On
> the other hand, stale entries will be written back quickly, which
> increases the effective writeback rate.
>
> The referenced bit is added at the hole after the `length` field of
> struct zswap_entry, so there is no extra space overhead for this
> algorithm.
>
> We will still maintain the count of swapins, which is consumed and
> subtracted from the lru size in zswap_shrinker_count(), to further
> penalize past overshrinking that led to disk swapins. The idea is that
> had we considered this many more pages in the LRU active/protected, they
> would not have been written back and we would not have had to swapped
> them in.
>
> To test this new heuristics, I built the kernel under a cgroup with
> memory.max set to 2G, on a host with 36 cores:
>
> With the old shrinker:
>
> real: 263.89s
> user: 4318.11s
> sys: 673.29s
> swapins: 227300.5
>
> With the second chance algorithm:
>
> real: 244.85s
> user: 4327.22s
> sys: 664.39s
> swapins: 94663
>
> (average over 5 runs)
>
> We observe an 1.3% reduction in kernel CPU usage, and around 7.2%
> reduction in real time. Note that the number of swapped in pages
> dropped by 58%.
>
> Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>
> ---
>  include/linux/zswap.h |  16 +++----
>  mm/zswap.c            | 108 ++++++++++++++++++++++++------------------
>  2 files changed, 70 insertions(+), 54 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 6cecb4a4f68b..9cd1beef0654 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages;
>
>  struct zswap_lruvec_state {
>         /*
> -        * Number of pages in zswap that should be protected from the shrinker.
> -        * This number is an estimate of the following counts:
> +        * Number of swapped in pages from disk, i.e not found in the zswap pool.
>          *
> -        * a) Recent page faults.
> -        * b) Recent insertion to the zswap LRU. This includes new zswap stores,
> -        *    as well as recent zswap LRU rotations.
> -        *
> -        * These pages are likely to be warm, and might incur IO if the are written
> -        * to swap.
> +        * This is consumed and subtracted from the lru size in
> +        * zswap_shrinker_count() to penalize past overshrinking that led to disk
> +        * swapins. The idea is that had we considered this many more pages in the
> +        * LRU active/protected and not written them back, we would not have had to
> +        * swapped them in.
>          */
> -       atomic_long_t nr_zswap_protected;
> +       atomic_long_t nr_disk_swapins;
>  };
>
>  unsigned long zswap_total_pages(void);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index adeaf9c97fde..fb3d9cb88785 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -187,6 +187,10 @@ static struct shrinker *zswap_shrinker;
>   * length - the length in bytes of the compressed page data.  Needed during
>   *          decompression. For a same value filled page length is 0, and both
>   *          pool and lru are invalid and must be ignored.
> + * referenced - true if the entry recently entered the zswap pool. Unset by the
> + *              dynamic shrinker. The entry is only reclaimed by the dynamic
> + *              shrinker if referenced is unset. See comments in the shrinker
> + *              section for context.

Nit: It is unset and reclaimed by the writeback logic in general,
which isn't necessarily triggered from the dynamic shrinker, right?

>   * pool - the zswap_pool the entry's data is in
>   * handle - zpool allocation handle that stores the compressed page data
>   * value - value of the same-value filled pages which have same content
> @@ -196,6 +200,7 @@ static struct shrinker *zswap_shrinker;
>  struct zswap_entry {
>         swp_entry_t swpentry;
>         unsigned int length;
> +       bool referenced;
>         struct zswap_pool *pool;
>         union {
>                 unsigned long handle;
> @@ -700,11 +705,8 @@ static inline int entry_to_nid(struct zswap_entry *entry)
>
>  static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
>  {
> -       atomic_long_t *nr_zswap_protected;
> -       unsigned long lru_size, old, new;
>         int nid = entry_to_nid(entry);
>         struct mem_cgroup *memcg;
> -       struct lruvec *lruvec;
>
>         /*
>          * Note that it is safe to use rcu_read_lock() here, even in the face of
> @@ -722,19 +724,6 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
>         memcg = mem_cgroup_from_entry(entry);
>         /* will always succeed */
>         list_lru_add(list_lru, &entry->lru, nid, memcg);
> -
> -       /* Update the protection area */
> -       lru_size = list_lru_count_one(list_lru, nid, memcg);
> -       lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> -       nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected;
> -       old = atomic_long_inc_return(nr_zswap_protected);
> -       /*
> -        * Decay to avoid overflow and adapt to changing workloads.
> -        * This is based on LRU reclaim cost decaying heuristics.
> -        */
> -       do {
> -               new = old > lru_size / 4 ? old / 2 : old;
> -       } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new));

Nice, arcane heuristics gone :)

LGTM with the above nit:
Acked-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>