On Wed, May 31, 2023 at 02:29:11AM +0000, Yosry Ahmed wrote: > Support using multiple zpools of the same type in zswap, for concurrency > purposes. Add CONFIG_ZSWAP_NR_ZPOOLS_ORDER to control the number of > zpools. The order is specific by the config rather than the absolute > number to guarantee a power of 2. This is useful so that we can use > deterministically link each entry to a zpool by hashing the zswap_entry > pointer. > > On a setup with zswap and zsmalloc, comparing a single zpool (current > default) to 32 zpools (by setting CONFIG_ZSWAP_NR_ZPOOLS_ORDER=32) shows > improvements in the zsmalloc lock contention, especially on the swap out > path. > > The following shows the perf analysis of the swapout path when 10 > workloads are simulatenously reclaiming and refaulting tmpfs pages. > There are some improvements on the swapin path as well, but much less > significant. > > 1 zpool: > > |--28.99%--zswap_frontswap_store > | | > <snip> > | | > | |--8.98%--zpool_map_handle > | | | > | | --8.98%--zs_zpool_map > | | | > | | --8.95%--zs_map_object > | | | > | | --8.38%--_raw_spin_lock > | | | > | | --7.39%--queued_spin_lock_slowpath > | | > | |--8.82%--zpool_malloc > | | | > | | --8.82%--zs_zpool_malloc > | | | > | | --8.80%--zs_malloc > | | | > | | |--7.21%--_raw_spin_lock > | | | | > | | | --6.81%--queued_spin_lock_slowpath > <snip> > > 32 zpools: > > |--16.73%--zswap_frontswap_store > | | > <snip> > | | > | |--1.81%--zpool_malloc > | | | > | | --1.81%--zs_zpool_malloc > | | | > | | --1.79%--zs_malloc > | | | > | | --0.73%--obj_malloc > | | > | |--1.06%--zswap_update_total_size > | | > | |--0.59%--zpool_map_handle > | | | > | | --0.59%--zs_zpool_map > | | | > | | --0.57%--zs_map_object > | | | > | | --0.51%--_raw_spin_lock > <snip> > > Suggested-by: Yu Zhao <yuzhao@xxxxxxxxxx> > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > --- > mm/Kconfig | 12 +++++++ > mm/zswap.c | 95 ++++++++++++++++++++++++++++++++++++------------------ > 2 files changed, 76 insertions(+), 31 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 92c30879bf67..de1da56d2c07 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -59,6 +59,18 @@ config ZSWAP_EXCLUSIVE_LOADS > The cost is that if the page was never dirtied and needs to be > swapped out again, it will be re-compressed. > > +config ZSWAP_NR_ZPOOLS_ORDER > + int "Number of zpools in zswap, as power of 2" > + default 0 > + depends on ZSWAP > + help > + This options determines the number of zpools to use for zswap, it > + will be 1 << CONFIG_ZSWAP_NR_ZPOOLS_ORDER. > + > + Having multiple zpools helps with concurrency and lock contention > + on the swap in and swap out paths, but uses a little bit of extra > + space. This is nearly impossible for a user, let alone a distribution, to answer adequately. The optimal value needs to be found empirically. And it varies heavily not just by workload but by implementation changes. If we make changes to the lock holding time or redo the data structures, a previously chosen value might no longer be a net positive, and even be harmful. Architecturally, the pool scoping and the interaction with zswap_tree is currently a mess. We're aware of lifetime bugs, where swapoff kills the tree but the pool still exists with entries making dead references e.g. We likely need to rearchitect this in the near future - maybe tie pools to trees to begin with. (I'm assuming you're already using multiple swap files to avoid tree lock contention, because it has the same scope as the pool otherwise.) Such changes quickly invalidate any settings the user or distro might make here. Exposing the implementation detail of the pools might even get in the way of fixing bugs and cleaning up the architecture. > @@ -263,11 +266,13 @@ static void zswap_update_total_size(void) > { > struct zswap_pool *pool; > u64 total = 0; > + int i; > > rcu_read_lock(); > > list_for_each_entry_rcu(pool, &zswap_pools, list) > - total += zpool_get_total_size(pool->zpool); > + for (i = 0; i < zswap_nr_zpools; i++) > + total += zpool_get_total_size(pool->zpools[i]); This adds a O(nr_pools) operation to every load and store. It's easy for this to dominate or outweigh locking costs as workload concurrency changes, or data structures and locking change inside the kernel. > @@ -587,14 +603,17 @@ static void shrink_worker(struct work_struct *w) > { > struct zswap_pool *pool = container_of(w, typeof(*pool), > shrink_work); > + int i; > > - if (zpool_shrink(pool->zpool, 1, NULL)) > - zswap_reject_reclaim_fail++; > + for (i = 0; i < zswap_nr_zpools; i++) > + if (zpool_shrink(pool->zpools[i], 1, NULL)) > + zswap_reject_reclaim_fail++; > zswap_pool_put(pool); This scales reclaim batch size by the number of zpools, which can lead to varying degrees of overreclaim especially on small zswap sizes with high pool concurrency. I don't think this patch is ready for primetime. A user build time setting is not appropriate for an optimization that is heavily tied to implementation details and workload dynamics.