On Mon, Mar 11, 2024 at 12:12:13PM -0400, Johannes Weiner wrote: > Profiling the munmap() of a zswapped memory region shows 50%(!) of the > total cycles currently going into updating the zswap_pool_total_size. Yikes. I have always hated that size update scheme FWIW. I have also wondered whether it makes sense to just maintain the number of pages in zswap as an atomic, like zswap_stored_pages. I guess your proposed scheme is even cheaper for the load/invalidate paths because we do nothing at all. It could be an option if the aggregation in other paths ever becomes a problem, but we would need to make sure it doesn't regress the load/invalidate paths. Just sharing some thoughts. > > There are three consumers of this counter: > - store, to enforce the globally configured pool limit > - meminfo & debugfs, to report the size to the user > - shrink, to determine the batch size for each cycle > > Instead of aggregating everytime an entry enters or exits the zswap > pool, aggregate the value from the zpools on-demand: > > - Stores aggregate the counter anyway upon success. Aggregating to > check the limit instead is the same amount of work. > > - Meminfo & debugfs might benefit somewhat from a pre-aggregated > counter, but aren't exactly hotpaths. > > - Shrinking can aggregate once for every cycle instead of doing it for > every freed entry. As the shrinker might work on tens or hundreds of > objects per scan cycle, this is a large reduction in aggregations. > > The paths that benefit dramatically are swapin, swapoff, and > unmaps. There could be millions of pages being processed until > somebody asks for the pool size again. This eliminates the pool size > updates from those paths entirely. This looks like a big win, thanks! I wonder if you have any numbers of perf profiles to share. That would be nice to have, but I think the benefit is clear regardless. I also like the implicit cleanup when we switch to maintaining the number of pages rather than bytes. The code looks much better with all the shifts and divisions gone :) I have a couple of comments below. With them addressed, feel free to add: Acked-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> [..] > @@ -1385,6 +1365,10 @@ static void shrink_worker(struct work_struct *w) > { > struct mem_cgroup *memcg; > int ret, failures = 0; > + unsigned long thr; > + > + /* Reclaim down to the accept threshold */ > + thr = zswap_max_pages() * zswap_accept_thr_percent / 100; This calculation is repeated twice, so I'd rather keep a helper for it as an alternative to zswap_can_accept(). Perhaps zswap_threshold_page() or zswap_acceptance_pages()? > > /* global reclaim will select cgroup in a round-robin fashion. */ > do { > @@ -1432,10 +1416,9 @@ static void shrink_worker(struct work_struct *w) > break; > if (ret && ++failures == MAX_RECLAIM_RETRIES) > break; > - > resched: > cond_resched(); > - } while (!zswap_can_accept()); > + } while (zswap_total_pages() > thr); > } [..] > @@ -1711,6 +1700,13 @@ void zswap_swapoff(int type) > > static struct dentry *zswap_debugfs_root; > > +static int debugfs_get_total_size(void *data, u64 *val) > +{ > + *val = zswap_total_pages() * PAGE_SIZE; > + return 0; > +} > +DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size, NULL, "%llu"); I think we are missing a newline here to maintain the current format (i.e "%llu\n"). > + > static int zswap_debugfs_init(void) > { > if (!debugfs_initialized()) > @@ -1732,8 +1728,8 @@ static int zswap_debugfs_init(void) > zswap_debugfs_root, &zswap_reject_compress_poor); > debugfs_create_u64("written_back_pages", 0444, > zswap_debugfs_root, &zswap_written_back_pages); > - debugfs_create_u64("pool_total_size", 0444, > - zswap_debugfs_root, &zswap_pool_total_size); > + debugfs_create_file("pool_total_size", 0444, > + zswap_debugfs_root, NULL, &total_size_fops); > debugfs_create_atomic_t("stored_pages", 0444, > zswap_debugfs_root, &zswap_stored_pages); > debugfs_create_atomic_t("same_filled_pages", 0444, > -- > 2.44.0 >