Re: [PATCH v7 6/6] zswap: shrinks zswap pool based on memory pressure

Nhat Pham <nphamcs@xxxxxxxxx> · Wed, 29 Nov 2023 15:44:39 -0800

On Wed, Nov 29, 2023 at 8:21 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Mon, Nov 27, 2023 at 03:46:00PM -0800, Nhat Pham wrote:
> > Currently, we only shrink the zswap pool when the user-defined limit is
> > hit. This means that if we set the limit too high, cold data that are
> > unlikely to be used again will reside in the pool, wasting precious
> > memory. It is hard to predict how much zswap space will be needed ahead
> > of time, as this depends on the workload (specifically, on factors such
> > as memory access patterns and compressibility of the memory pages).
> >
> > This patch implements a memcg- and NUMA-aware shrinker for zswap, that
> > is initiated when there is memory pressure. The shrinker does not
> > have any parameter that must be tuned by the user, and can be opted in
> > or out on a per-memcg basis.
> >
> > Furthermore, to make it more robust for many workloads and prevent
> > overshrinking (i.e evicting warm pages that might be refaulted into
> > memory), we build in the following heuristics:
> >
> > * Estimate the number of warm pages residing in zswap, and attempt to
> >   protect this region of the zswap LRU.
> > * Scale the number of freeable objects by an estimate of the memory
> >   saving factor. The better zswap compresses the data, the fewer pages
> >   we will evict to swap (as we will otherwise incur IO for relatively
> >   small memory saving).
> > * During reclaim, if the shrinker encounters a page that is also being
> >   brought into memory, the shrinker will cautiously terminate its
> >   shrinking action, as this is a sign that it is touching the warmer
> >   region of the zswap LRU.
> >
> > As a proof of concept, we ran the following synthetic benchmark:
> > build the linux kernel in a memory-limited cgroup, and allocate some
> > cold data in tmpfs to see if the shrinker could write them out and
> > improved the overall performance. Depending on the amount of cold data
> > generated, we observe from 14% to 35% reduction in kernel CPU time used
> > in the kernel builds.
>
> I think this is a really important piece of functionality for zswap.
>
> Memory compression has been around and in use for a long time, but the
> question of zswap vs swap sizing has been in the room since the hybrid
> mode was first proposed. Because depending on the reuse patterns,
> putting zswap with a static size limit in front of an existing swap
> file can be a net negative for performance as it consumes more memory.
>
> It's great to finally see a solution to this which makes zswap *much*
> more general purpose. And something that distributions might want to
> turn on per default when swap is configured.
>
> Actually to the point where I think there should be a config option to
> enable the shrinker per default. Maybe not right away, but in a few
> releases when this feature has racked up some more production time.

Sure thingy - how does everyone feel about this?

>
> > @@ -687,6 +687,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >                                       &page_allocated, false);
> >       if (unlikely(page_allocated))
> >               swap_readpage(page, false, NULL);
> > +     zswap_lruvec_swapin(page);
>
> The "lruvec" in the name vs the page parameter is a bit odd.
> zswap_page_swapin() would be slightly better, but it still also sounds
> like it would cause an actual swapin of some sort.
>
> zswap_record_swapin()?

Hmm that sounds good to me. I'm not very good with naming, if that's
not already evident :)

>
> > @@ -520,6 +575,95 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root,
> >       return entry;
> >  }
> >
> > +/*********************************
> > +* shrinker functions
> > +**********************************/
> > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
> > +                                    spinlock_t *lock, void *arg);
> > +
> > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> > +             struct shrink_control *sc)
> > +{
> > +     struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> > +     unsigned long shrink_ret, nr_protected, lru_size;
> > +     struct zswap_pool *pool = shrinker->private_data;
> > +     bool encountered_page_in_swapcache = false;
> > +
> > +     nr_protected =
> > +             atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> > +     lru_size = list_lru_shrink_count(&pool->list_lru, sc);
> > +
> > +     /*
> > +      * Abort if the shrinker is disabled or if we are shrinking into the
> > +      * protected region.
> > +      */
> > +     if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
> > +             sc->nr_scanned = 0;
> > +             return SHRINK_STOP;
> > +     }
>
> I'm scratching my head at the protection check. zswap_shrinker_count()
> already factors protection into account, so sc->nr_to_scan should only
> be what is left on the list after excluding the protected area.
>
> Do we even get here if the whole list is protected? Is this to protect
> against concurrent shrinking of the list through multiple shrinkers or
> swapins? If so, a comment would be nice :)

Yep, this is mostly for concurrent shrinkers. Please fact-check me,
but IIUC if we have too many reclaimers all calling upon the zswap
shrinker (before any of them can make substantial progress), we can
have a situation where the total number of objects freed by the
reclaimers will eat into the protection area of the zswap LRU (even if
the number of freeable objects is scaled down by the compression
ratio, and further scaled down internally in the shrinker/vmscan
code). I've observed this tendency when there is a) a lot of worker
threads in my benchmark and b) memory pressure.  This is a crude/racy
way to alleviate the issue.

I think this is actually a wider problem than just zswap and zswap
shrinker - we need better reclaimer throttling logic IMO. Perhaps this
check should be done higher up the stack - something along the lines
of having each reclaimer "register" its intention (number of objects
it wants to reclaim) to a particular shrinker, allowing the shrinker
to deny a reclaimer if there is already a strong reclaim driving
force. Or some other throttling heuristics based on the number of
freeable objects and the reclaimer registration data.

>
> Otherwise, this looks great to me!
>
> Just nitpicks, no show stoppers:
>
> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>