Re: [PATCH v6 6/6] zswap: shrinks zswap pool based on memory pressure

Nhat Pham <nphamcs@xxxxxxxxx> · Tue, 28 Nov 2023 15:04:03 -0800

On Mon, Nov 27, 2023 at 1:00 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, 27 Nov 2023 11:37:03 -0800 Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> > Currently, we only shrink the zswap pool when the user-defined limit is
> > hit. This means that if we set the limit too high, cold data that are
> > unlikely to be used again will reside in the pool, wasting precious
> > memory. It is hard to predict how much zswap space will be needed ahead
> > of time, as this depends on the workload (specifically, on factors such
> > as memory access patterns and compressibility of the memory pages).
> >
> > This patch implements a memcg- and NUMA-aware shrinker for zswap, that
> > is initiated when there is memory pressure. The shrinker does not
> > have any parameter that must be tuned by the user, and can be opted in
> > or out on a per-memcg basis.
> >
> > Furthermore, to make it more robust for many workloads and prevent
> > overshrinking (i.e evicting warm pages that might be refaulted into
> > memory), we build in the following heuristics:
> >
> > * Estimate the number of warm pages residing in zswap, and attempt to
> >   protect this region of the zswap LRU.
> > * Scale the number of freeable objects by an estimate of the memory
> >   saving factor. The better zswap compresses the data, the fewer pages
> >   we will evict to swap (as we will otherwise incur IO for relatively
> >   small memory saving).
> > * During reclaim, if the shrinker encounters a page that is also being
> >   brought into memory, the shrinker will cautiously terminate its
> >   shrinking action, as this is a sign that it is touching the warmer
> >   region of the zswap LRU.
> >
> > As a proof of concept, we ran the following synthetic benchmark:
> > build the linux kernel in a memory-limited cgroup, and allocate some
> > cold data in tmpfs to see if the shrinker could write them out and
> > improved the overall performance. Depending on the amount of cold data
> > generated, we observe from 14% to 35% reduction in kernel CPU time used
> > in the kernel builds.
> >
> > ...
> >
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -22,6 +22,7 @@
> >  #include <linux/mm_types.h>
> >  #include <linux/page-flags.h>
> >  #include <linux/local_lock.h>
> > +#include <linux/zswap.h>
> >  #include <asm/page.h>
> >
> >  /* Free memory management - zoned buddy allocator.  */
> > @@ -641,6 +642,7 @@ struct lruvec {
> >  #ifdef CONFIG_MEMCG
> >       struct pglist_data *pgdat;
> >  #endif
> > +     struct zswap_lruvec_state zswap_lruvec_state;
>
> Normally we'd put this in #ifdef CONFIG_ZSWAP.
>
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -5,20 +5,40 @@
> >  #include <linux/types.h>
> >  #include <linux/mm_types.h>
> >
> > +struct lruvec;
> > +
> >  extern u64 zswap_pool_total_size;
> >  extern atomic_t zswap_stored_pages;
> >
> >  #ifdef CONFIG_ZSWAP
> >
> > +struct zswap_lruvec_state {
> > +     /*
> > +      * Number of pages in zswap that should be protected from the shrinker.
> > +      * This number is an estimate of the following counts:
> > +      *
> > +      * a) Recent page faults.
> > +      * b) Recent insertion to the zswap LRU. This includes new zswap stores,
> > +      *    as well as recent zswap LRU rotations.
> > +      *
> > +      * These pages are likely to be warm, and might incur IO if the are written
> > +      * to swap.
> > +      */
> > +     atomic_long_t nr_zswap_protected;
> > +};
> > +
> >  bool zswap_store(struct folio *folio);
> >  bool zswap_load(struct folio *folio);
> >  void zswap_invalidate(int type, pgoff_t offset);
> >  void zswap_swapon(int type);
> >  void zswap_swapoff(int type);
> >  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
> > -
> > +void zswap_lruvec_state_init(struct lruvec *lruvec);
> > +void zswap_lruvec_swapin(struct page *page);
> >  #else
> >
> > +struct zswap_lruvec_state {};
>
> But instead you made it an empty struct in this case.
>
> That's a bit funky, but I guess OK.  It does send a careful reader of
> struct lruvec over to look at the zswap_lruvec_state definition to
> understand what's going on.

I agree.

Originally, I included the fields in struct lruvec itself, guarded by
a #ifdef CONFIG_ZSWAP as you pointed out here. Yosry gave me this
suggestion to hide the zswap-specific states and details from ordinary
lruvec user, and direct people who care and/or need to know about
details towards the zswap codebase, which is good IMHO.

It is a bit weird I admit, but in this case I think it works out.

>
> >  static inline bool zswap_store(struct folio *folio)
> >  {
> >       return false;
> > @@ -33,7 +53,8 @@ static inline void zswap_invalidate(int type, pgoff_t offset) {}
> >  static inline void zswap_swapon(int type) {}
> >  static inline void zswap_swapoff(int type) {}
> >  static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
> > -
> > +static inline void zswap_lruvec_init(struct lruvec *lruvec) {}
> > +static inline void zswap_lruvec_swapin(struct page *page) {}
>
> Needed this build fix:
>
> --- a/include/linux/zswap.h~zswap-shrinks-zswap-pool-based-on-memory-pressure-fix
> +++ a/include/linux/zswap.h
> @@ -54,6 +54,7 @@ static inline void zswap_swapon(int type
>  static inline void zswap_swapoff(int type) {}
>  static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
>  static inline void zswap_lruvec_init(struct lruvec *lruvec) {}
> +static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
>  static inline void zswap_lruvec_swapin(struct page *page) {}
>  #endif
>
> _
>

Yeah that looks like a typo on my part. My bad. v7 includes this fix.