On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> wrote: > When testing the zswap performance by using kernel build -j32 in a tmpfs > directory, I found the scalability of zswap rb-tree is not good, which > is protected by the only spinlock. That would cause heavy lock contention > if multiple tasks zswap_store/load concurrently. > > So a simple solution is to split the only one zswap rb-tree into multiple > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > > Although this method can't solve the spinlock contention completely, it > can mitigate much of that contention. By how much? Do you have any stats to estimate the amount of contention and the reduction by this patch? I do think lock contention could be a problem here, and it will be even worse with the zswap shrinker enabled (which introduces an theoretically unbounded number of concurrent reclaimers hammering on the zswap rbtree and its lock). I am generally a bit weary about architectural change though, especially if it is just a bandaid. We have tried to reduce the lock contention somewhere else (multiple zpools), and as predicted it just shifts the contention point elsewhere. Maybe we need a deeper architectural re-think. Not an outright NACK of course - just food for thought. > > Another problem when testing the zswap using our default zsmalloc is that > zswap_load() and zswap_writeback_entry() have to malloc a temporary memory > to support !zpool_can_sleep_mapped(). > > Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also > used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex. It'd be nice to reduce the (temporary) memory allocation on these paths, but would this introduce contention on the per-cpu dstmem and the mutex that protects it, if there are too many concurrent store/load/writeback requests?