On Sun, Jul 9, 2023 at 4:12 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > On Tue, Jun 20, 2023 at 12:46 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > Support using multiple zpools of the same type in zswap, for concurrency > > purposes. A fixed number of 32 zpools is suggested by this commit, which > > was determined empirically. It can be later changed or made into a > > config option if needed. > > > > On a setup with zswap and zsmalloc, comparing a single zpool to 32 > > zpools shows improvements in the zsmalloc lock contention, especially on > > the swap out path. > > > > The following shows the perf analysis of the swapout path when 10 > > workloads are simultaneously reclaiming and refaulting tmpfs pages. > > There are some improvements on the swap in path as well, but less > > significant. > > > > 1 zpool: > > > > |--28.99%--zswap_frontswap_store > > | > > <snip> > > | > > |--8.98%--zpool_map_handle > > | | > > | --8.98%--zs_zpool_map > > | | > > | --8.95%--zs_map_object > > | | > > | --8.38%--_raw_spin_lock > > | | > > | --7.39%--queued_spin_lock_slowpath > > | > > |--8.82%--zpool_malloc > > | | > > | --8.82%--zs_zpool_malloc > > | | > > | --8.80%--zs_malloc > > | | > > | |--7.21%--_raw_spin_lock > > | | | > > | | --6.81%--queued_spin_lock_slowpath > > <snip> > > > > 32 zpools: > > > > |--16.73%--zswap_frontswap_store > > | > > <snip> > > | > > |--1.81%--zpool_malloc > > | | > > | --1.81%--zs_zpool_malloc > > | | > > | --1.79%--zs_malloc > > | | > > | --0.73%--obj_malloc > > | > > |--1.06%--zswap_update_total_size > > | > > |--0.59%--zpool_map_handle > > | | > > | --0.59%--zs_zpool_map > > | | > > | --0.57%--zs_map_object > > | | > > | --0.51%--_raw_spin_lock > > <snip> > > > > Suggested-by: Yu Zhao <yuzhao@xxxxxxxxxx> > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > --- > > > > v2 -> v3: > > - Removed config option (Johannes Weiner). Now it's a constant. > > - Fixed spelling typos (Yu Zhao). > > > > v1 -> v2: > > - Prettified perf graph in commit log. > > - Changed zswap_nr_zpools to a macro, changed zswap_pool->zpools to a > > fixed size array instead of a flex array. > > - Removed stale comment. > > > > --- > > mm/zswap.c | 81 ++++++++++++++++++++++++++++++++++++------------------ > > 1 file changed, 54 insertions(+), 27 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 87b204233115..6ee7028497b8 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -142,6 +142,9 @@ static bool zswap_exclusive_loads_enabled = IS_ENABLED( > > CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON); > > module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644); > > > > +/* Number of zpools in zswap_pool (empirically determined for scalability) */ > > +#define ZSWAP_NR_ZPOOLS 32 > > + > > /********************************* > > * data structures > > **********************************/ > > @@ -161,7 +164,7 @@ struct crypto_acomp_ctx { > > * needs to be verified that it's still valid in the tree. > > */ > > struct zswap_pool { > > - struct zpool *zpool; > > + struct zpool *zpools[ZSWAP_NR_ZPOOLS]; > > struct crypto_acomp_ctx __percpu *acomp_ctx; > > struct kref kref; > > struct list_head list; > > @@ -248,7 +251,7 @@ static bool zswap_has_pool; > > > > #define zswap_pool_debug(msg, p) \ > > pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \ > > - zpool_get_type((p)->zpool)) > > + zpool_get_type((p)->zpools[0])) > > > > static int zswap_writeback_entry(struct zswap_entry *entry, > > struct zswap_tree *tree); > > @@ -272,11 +275,13 @@ static void zswap_update_total_size(void) > > { > > struct zswap_pool *pool; > > u64 total = 0; > > + int i; > > > > rcu_read_lock(); > > > > list_for_each_entry_rcu(pool, &zswap_pools, list) > > - total += zpool_get_total_size(pool->zpool); > > + for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) > > + total += zpool_get_total_size(pool->zpools[i]); > > > > rcu_read_unlock(); > > > > @@ -363,6 +368,16 @@ static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > } > > } > > > > +static struct zpool *zswap_find_zpool(struct zswap_entry *entry) > > +{ > > + int i = 0; > > + > > + if (ZSWAP_NR_ZPOOLS > 1) > > + i = hash_ptr(entry, ilog2(ZSWAP_NR_ZPOOLS)); > > + > > + return entry->pool->zpools[i]; > > +} > > + > > /* > > * Carries out the common pattern of freeing and entry's zpool allocation, > > * freeing the entry itself, and decrementing the number of stored pages. > > @@ -379,7 +394,7 @@ static void zswap_free_entry(struct zswap_entry *entry) > > spin_lock(&entry->pool->lru_lock); > > list_del(&entry->lru); > > spin_unlock(&entry->pool->lru_lock); > > - zpool_free(entry->pool->zpool, entry->handle); > > + zpool_free(zswap_find_zpool(entry), entry->handle); > > zswap_pool_put(entry->pool); > > } > > zswap_entry_cache_free(entry); > > @@ -588,7 +603,8 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) > > list_for_each_entry_rcu(pool, &zswap_pools, list) { > > if (strcmp(pool->tfm_name, compressor)) > > continue; > > - if (strcmp(zpool_get_type(pool->zpool), type)) > > + /* all zpools share the same type */ > > + if (strcmp(zpool_get_type(pool->zpools[0]), type)) > > continue; > > /* if we can't get it, it's about to be destroyed */ > > if (!zswap_pool_get(pool)) > > @@ -692,6 +708,7 @@ static void shrink_worker(struct work_struct *w) > > > > static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > > { > > + int i; > > struct zswap_pool *pool; > > char name[38]; /* 'zswap' + 32 char (max) num + \0 */ > > gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM; > > @@ -712,15 +729,18 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > > if (!pool) > > return NULL; > > > > - /* unique name for each pool specifically required by zsmalloc */ > > - snprintf(name, 38, "zswap%x", atomic_inc_return(&zswap_pools_count)); > > + for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) { > > + /* unique name for each pool specifically required by zsmalloc */ > > + snprintf(name, 38, "zswap%x", > > + atomic_inc_return(&zswap_pools_count)); > > > > - pool->zpool = zpool_create_pool(type, name, gfp); > > - if (!pool->zpool) { > > - pr_err("%s zpool not available\n", type); > > - goto error; > > + pool->zpools[i] = zpool_create_pool(type, name, gfp); > > + if (!pool->zpools[i]) { > > + pr_err("%s zpool not available\n", type); > > + goto error; > > + } > > } > > - pr_debug("using %s zpool\n", zpool_get_type(pool->zpool)); > > + pr_debug("using %s zpool\n", zpool_get_type(pool->zpools[0])); > > > > strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name)); > > > > @@ -752,8 +772,8 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > > error: > > if (pool->acomp_ctx) > > free_percpu(pool->acomp_ctx); > > - if (pool->zpool) > > - zpool_destroy_pool(pool->zpool); > > + while (i--) > > + zpool_destroy_pool(pool->zpools[i]); > > kfree(pool); > > return NULL; > > } > > @@ -802,11 +822,14 @@ static struct zswap_pool *__zswap_pool_create_fallback(void) > > > > static void zswap_pool_destroy(struct zswap_pool *pool) > > { > > + int i; > > + > > zswap_pool_debug("destroying", pool); > > > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); > > free_percpu(pool->acomp_ctx); > > - zpool_destroy_pool(pool->zpool); > > + for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) > > + zpool_destroy_pool(pool->zpools[i]); > > kfree(pool); > > } > > > > @@ -1070,7 +1093,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > struct page *page; > > struct scatterlist input, output; > > struct crypto_acomp_ctx *acomp_ctx; > > - struct zpool *pool = entry->pool->zpool; > > + struct zpool *pool = zswap_find_zpool(entry); > > > > u8 *src, *tmp = NULL; > > unsigned int dlen; > > @@ -1211,6 +1234,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > struct crypto_acomp_ctx *acomp_ctx; > > struct obj_cgroup *objcg = NULL; > > struct zswap_pool *pool; > > + struct zpool *zpool; > > int ret; > > unsigned int dlen = PAGE_SIZE; > > unsigned long handle, value; > > @@ -1321,10 +1345,11 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > } > > > > /* store */ > > + zpool = zswap_find_zpool(entry); > > gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM; > > - if (zpool_malloc_support_movable(entry->pool->zpool)) > > + if (zpool_malloc_support_movable(zpool)) > > gfp |= __GFP_HIGHMEM | __GFP_MOVABLE; > > - ret = zpool_malloc(entry->pool->zpool, dlen, gfp, &handle); > > + ret = zpool_malloc(zpool, dlen, gfp, &handle); > > if (ret == -ENOSPC) { > > zswap_reject_compress_poor++; > > goto put_dstmem; > > @@ -1333,9 +1358,9 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > zswap_reject_alloc_fail++; > > goto put_dstmem; > > } > > - buf = zpool_map_handle(entry->pool->zpool, handle, ZPOOL_MM_WO); > > + buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO); > > memcpy(buf, dst, dlen); > > - zpool_unmap_handle(entry->pool->zpool, handle); > > + zpool_unmap_handle(zpool, handle); > > mutex_unlock(acomp_ctx->mutex); > > > > /* populate entry */ > > @@ -1406,6 +1431,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > struct scatterlist input, output; > > struct crypto_acomp_ctx *acomp_ctx; > > u8 *src, *dst, *tmp; > > + struct zpool *zpool; > > unsigned int dlen; > > int ret; > > > > @@ -1427,7 +1453,8 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > goto stats; > > } > > > > - if (!zpool_can_sleep_mapped(entry->pool->zpool)) { > > + zpool = zswap_find_zpool(entry); > > + if (!zpool_can_sleep_mapped(zpool)) { > > tmp = kmalloc(entry->length, GFP_KERNEL); > > if (!tmp) { > > ret = -ENOMEM; > > @@ -1437,12 +1464,12 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > > > /* decompress */ > > dlen = PAGE_SIZE; > > - src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); > > + src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO); > > > > - if (!zpool_can_sleep_mapped(entry->pool->zpool)) { > > + if (!zpool_can_sleep_mapped(zpool)) { > > memcpy(tmp, src, entry->length); > > src = tmp; > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > + zpool_unmap_handle(zpool, entry->handle); > > } > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > @@ -1454,8 +1481,8 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); > > mutex_unlock(acomp_ctx->mutex); > > > > - if (zpool_can_sleep_mapped(entry->pool->zpool)) > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > + if (zpool_can_sleep_mapped(zpool)) > > + zpool_unmap_handle(zpool, entry->handle); > > else > > kfree(tmp); > > > > @@ -1616,7 +1643,7 @@ static int zswap_setup(void) > > pool = __zswap_pool_create_fallback(); > > if (pool) { > > pr_info("loaded using pool %s/%s\n", pool->tfm_name, > > - zpool_get_type(pool->zpool)); > > + zpool_get_type(pool->zpools[0])); > > list_add(&pool->list, &zswap_pools); > > zswap_has_pool = true; > > } else { > > -- > > 2.41.0.162.gfafddb0af9-goog > > > > In terms of correctness, the code LGTM. > However, I do share Johannes' concern about this change. > > May I ask how sensitive is the system performance to the number of pools? > i.e during the tuning process, did you see lots of performance > variability as you vary the number of pools? Is 32 a number that > works well across workloads, hardware, etc? I did not tune this myself, it has been used in our fleet for many years now, and honestly I am not sure what range of values was tried out. For us, 32 is a number that works well across our entire fleet (that uses zswap ofc). > > Personally, I prefer the per-CPU pool idea - or some way to automatically > determine the number of pools that are more generic than this 32 value > (what if we have new hardware with more CPUs? Would 32 still be valid, > or do we have to change it?). Ideally, we would have the number of zpools be a function of the system specs, or even autotune based on the workload. However, I don't think we have a clear idea about what this should look like. While a constant value is suboptimal, we have multiple constants in MM that seem to be working relatively well across different machines and workloads (e.g. SWAP_CLUSTER_MAX) -- so it's not unheard of. We have been using 32 zpools across different machines and workloads for years now. I would be hesitant to throw away years of production testing right away, without data to back that something else is better. I would prefer to start with something that (at least in our fleet) is proven to be good, and we can easily extend it later to replace 32 with a more sophisticated formula or something that is calculated at boot or even tuned by userspace. Having the support in the code to have multiple zpools is valuable as-is imo. > > I'm experimenting with some other zswap changes - if I have > extra cycles and resources I'll try to apply this patch and see how the > numbers play out. That would be amazing. Looking forward to any numbers you can dig :) > > I'll defer to Johannes and other reviewers for further comments.