On Mon, Nov 27, 2023 at 03:46:00PM -0800, Nhat Pham wrote: > Currently, we only shrink the zswap pool when the user-defined limit is > hit. This means that if we set the limit too high, cold data that are > unlikely to be used again will reside in the pool, wasting precious > memory. It is hard to predict how much zswap space will be needed ahead > of time, as this depends on the workload (specifically, on factors such > as memory access patterns and compressibility of the memory pages). > > This patch implements a memcg- and NUMA-aware shrinker for zswap, that > is initiated when there is memory pressure. The shrinker does not > have any parameter that must be tuned by the user, and can be opted in > or out on a per-memcg basis. > > Furthermore, to make it more robust for many workloads and prevent > overshrinking (i.e evicting warm pages that might be refaulted into > memory), we build in the following heuristics: > > * Estimate the number of warm pages residing in zswap, and attempt to > protect this region of the zswap LRU. > * Scale the number of freeable objects by an estimate of the memory > saving factor. The better zswap compresses the data, the fewer pages > we will evict to swap (as we will otherwise incur IO for relatively > small memory saving). > * During reclaim, if the shrinker encounters a page that is also being > brought into memory, the shrinker will cautiously terminate its > shrinking action, as this is a sign that it is touching the warmer > region of the zswap LRU. > > As a proof of concept, we ran the following synthetic benchmark: > build the linux kernel in a memory-limited cgroup, and allocate some > cold data in tmpfs to see if the shrinker could write them out and > improved the overall performance. Depending on the amount of cold data > generated, we observe from 14% to 35% reduction in kernel CPU time used > in the kernel builds. I think this is a really important piece of functionality for zswap. Memory compression has been around and in use for a long time, but the question of zswap vs swap sizing has been in the room since the hybrid mode was first proposed. Because depending on the reuse patterns, putting zswap with a static size limit in front of an existing swap file can be a net negative for performance as it consumes more memory. It's great to finally see a solution to this which makes zswap *much* more general purpose. And something that distributions might want to turn on per default when swap is configured. Actually to the point where I think there should be a config option to enable the shrinker per default. Maybe not right away, but in a few releases when this feature has racked up some more production time. > @@ -687,6 +687,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, > &page_allocated, false); > if (unlikely(page_allocated)) > swap_readpage(page, false, NULL); > + zswap_lruvec_swapin(page); The "lruvec" in the name vs the page parameter is a bit odd. zswap_page_swapin() would be slightly better, but it still also sounds like it would cause an actual swapin of some sort. zswap_record_swapin()? > @@ -520,6 +575,95 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, > return entry; > } > > +/********************************* > +* shrinker functions > +**********************************/ > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l, > + spinlock_t *lock, void *arg); > + > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > + struct shrink_control *sc) > +{ > + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); > + unsigned long shrink_ret, nr_protected, lru_size; > + struct zswap_pool *pool = shrinker->private_data; > + bool encountered_page_in_swapcache = false; > + > + nr_protected = > + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected); > + lru_size = list_lru_shrink_count(&pool->list_lru, sc); > + > + /* > + * Abort if the shrinker is disabled or if we are shrinking into the > + * protected region. > + */ > + if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) { > + sc->nr_scanned = 0; > + return SHRINK_STOP; > + } I'm scratching my head at the protection check. zswap_shrinker_count() already factors protection into account, so sc->nr_to_scan should only be what is left on the list after excluding the protected area. Do we even get here if the whole list is protected? Is this to protect against concurrent shrinking of the list through multiple shrinkers or swapins? If so, a comment would be nice :) Otherwise, this looks great to me! Just nitpicks, no show stoppers: Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>