Re: [PATCH v6] zswap: replace RB tree with xarray

Johannes Weiner <hannes@xxxxxxxxxxx> · Tue, 12 Mar 2024 14:43:29 -0400

On Tue, Mar 12, 2024 at 10:31:12AM -0700, Chris Li wrote:
> Very deep RB tree requires rebalance at times. That
> contributes to the zswap fault latencies. Xarray does not
> need to perform tree rebalance. Replacing RB tree to xarray
> can have some small performance gain.
> 
> One small difference is that xarray insert might fail with
> ENOMEM, while RB tree insert does not allocate additional
> memory.
> 
> The zswap_entry size will reduce a bit due to removing the
> RB node, which has two pointers and a color field. Xarray
> store the pointer in the xarray tree rather than the
> zswap_entry. Every entry has one pointer from the xarray
> tree. Overall, switching to xarray should save some memory,
> if the swap entries are densely packed.
> 
> Notice the zswap_rb_search and zswap_rb_insert always
> followed by zswap_rb_erase. Use xa_erase and xa_store
> directly. That saves one tree lookup as well.
> 
> Remove zswap_invalidate_entry due to no need to call
> zswap_rb_erase any more. Use zswap_free_entry instead.
> 
> The "struct zswap_tree" has been replaced by "struct xarray".
> The tree spin lock has transferred to the xarray lock.
> 
> Run the kernel build testing 10 times for each version, averages:
> (memory.max=2GB, zswap shrinker and writeback enabled,
> one 50GB swapfile, 24 HT core, 32 jobs)
> 
> mm-9a0181a3710eb             xarray v5
> user       3532.385			3535.658
> sys        536.231                      530.083
> real       200.431                      200.176

This is a great improvement code and complexity wise.

I have a few questions and comments below:

What kernel version is this based on? It doesn't apply to
mm-everything, and I can't find 9a0181a3710eb anywhere.

> @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio)
>  insert_entry:
>  	entry->swpentry = swp;
>  	entry->objcg = objcg;
> -	if (objcg) {
> -		obj_cgroup_charge_zswap(objcg, entry->length);
> -		/* Account before objcg ref is moved to tree */
> -		count_objcg_event(objcg, ZSWPOUT);
> -	}
>  
> -	/* map */
> -	spin_lock(&tree->lock);
>  	/*
>  	 * The folio may have been dirtied again, invalidate the
>  	 * possibly stale entry before inserting the new entry.
>  	 */

The comment is now somewhat stale and somewhat out of place. It should
be above that `if (old)` part... See below.

> -	if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> -		zswap_invalidate_entry(tree, dupentry);
> -		WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> +	old = xa_store(tree, offset, entry, GFP_KERNEL);
> +	if (xa_is_err(old)) {
> +		int err = xa_err(old);
> +		if (err == -ENOMEM)
> +			zswap_reject_alloc_fail++;
> +		else
> +			WARN_ONCE(err, "%s: xa_store failed: %d\n",
> +				  __func__, err);
> +		goto store_failed;

No need to complicate it. If we have a bug there, an incorrect fail
stat bump is the least of our concerns. Also, no need for __func__
since that information is included in the WARN:

	if (xa_is_err(old)) {
		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
		zswap_reject_alloc_fail++;
		goto store_failed;
	}

I think here is where that comment above should go:

	/*
	 * We may have had an existing entry that became stale when
	 * the folio was redirtied and now the new version is being
	 * swapped out. Get rid of the old.
	 */
> +	if (old)
> +		zswap_entry_free(old);
> +
> +	if (objcg) {
> +		obj_cgroup_charge_zswap(objcg, entry->length);
> +		/* Account before objcg ref is moved to tree */
> +		count_objcg_event(objcg, ZSWPOUT);
>  	}
> +
>  	if (entry->length) {
>  		INIT_LIST_HEAD(&entry->lru);
>  		zswap_lru_add(&zswap.list_lru, entry);
>  		atomic_inc(&zswap.nr_stored);
>  	}
> -	spin_unlock(&tree->lock);

We previously relied on the tree lock to finish initializing the entry
while it's already in tree. Now we rely on something else:

	1. Concurrent stores and invalidations are excluded by folio lock.

	2. Writeback is excluded by the entry not being on the LRU yet.
	   The publishing order matters to prevent writeback from seeing
	   an incoherent entry.

I think this deserves a comment.

>  	/* update stats */
>  	atomic_inc(&zswap_stored_pages);
> @@ -1585,6 +1510,12 @@ bool zswap_store(struct folio *folio)
>  
>  	return true;
>  
> +store_failed:
> +	if (!entry->length) {
> +		atomic_dec(&zswap_same_filled_pages);
> +		goto freepage;
> +	}

It'd be good to avoid the nested goto. Why not make the pool
operations conditional on entry->length instead:

store_failed:
	if (!entry->length)
		atomic_dec(&zswap_same_filled_pages);
	else {
		zpool_free(zswap_find_zpool(...));
put_pool:
		zswap_pool_put(entry->pool);
	}
freepage:

Not super pretty either, but it's a linear flow at least.