Hi Johannes, On Tue, Mar 12, 2024 at 11:43 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Tue, Mar 12, 2024 at 10:31:12AM -0700, Chris Li wrote: > > Very deep RB tree requires rebalance at times. That > > contributes to the zswap fault latencies. Xarray does not > > need to perform tree rebalance. Replacing RB tree to xarray > > can have some small performance gain. > > > > One small difference is that xarray insert might fail with > > ENOMEM, while RB tree insert does not allocate additional > > memory. > > > > The zswap_entry size will reduce a bit due to removing the > > RB node, which has two pointers and a color field. Xarray > > store the pointer in the xarray tree rather than the > > zswap_entry. Every entry has one pointer from the xarray > > tree. Overall, switching to xarray should save some memory, > > if the swap entries are densely packed. > > > > Notice the zswap_rb_search and zswap_rb_insert always > > followed by zswap_rb_erase. Use xa_erase and xa_store > > directly. That saves one tree lookup as well. > > > > Remove zswap_invalidate_entry due to no need to call > > zswap_rb_erase any more. Use zswap_free_entry instead. > > > > The "struct zswap_tree" has been replaced by "struct xarray". > > The tree spin lock has transferred to the xarray lock. > > > > Run the kernel build testing 10 times for each version, averages: > > (memory.max=2GB, zswap shrinker and writeback enabled, > > one 50GB swapfile, 24 HT core, 32 jobs) > > > > mm-9a0181a3710eb xarray v5 > > user 3532.385 3535.658 > > sys 536.231 530.083 > > real 200.431 200.176 > > This is a great improvement code and complexity wise. Thanks! > > I have a few questions and comments below: > > What kernel version is this based on? It doesn't apply to > mm-everything, and I can't find 9a0181a3710eb anywhere. It is based on an old version of the mm-unstable. I can try to rebase on mm-everything or later mm-unstable. > > > @@ -1555,28 +1473,35 @@ bool zswap_store(struct folio *folio) > > insert_entry: > > entry->swpentry = swp; > > entry->objcg = objcg; > > - if (objcg) { > > - obj_cgroup_charge_zswap(objcg, entry->length); > > - /* Account before objcg ref is moved to tree */ > > - count_objcg_event(objcg, ZSWPOUT); > > - } > > > > - /* map */ > > - spin_lock(&tree->lock); > > /* > > * The folio may have been dirtied again, invalidate the > > * possibly stale entry before inserting the new entry. > > */ > > The comment is now somewhat stale and somewhat out of place. It should > be above that `if (old)` part... See below. Ack. > > > - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { > > - zswap_invalidate_entry(tree, dupentry); > > - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry)); > > + old = xa_store(tree, offset, entry, GFP_KERNEL); > > + if (xa_is_err(old)) { > > + int err = xa_err(old); > > + if (err == -ENOMEM) > > + zswap_reject_alloc_fail++; > > + else > > + WARN_ONCE(err, "%s: xa_store failed: %d\n", > > + __func__, err); > > + goto store_failed; > > No need to complicate it. If we have a bug there, an incorrect fail > stat bump is the least of our concerns. Also, no need for __func__ > since that information is included in the WARN: > > if (xa_is_err(old)) { > WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err); > zswap_reject_alloc_fail++; > goto store_failed; > } Ah, I see. Thanks for the simplification. > > I think here is where that comment above should go: Ack. > > /* > * We may have had an existing entry that became stale when > * the folio was redirtied and now the new version is being > * swapped out. Get rid of the old. > */ > > + if (old) > > + zswap_entry_free(old); > > + > > + if (objcg) { > > + obj_cgroup_charge_zswap(objcg, entry->length); > > + /* Account before objcg ref is moved to tree */ > > + count_objcg_event(objcg, ZSWPOUT); > > } > > + > > if (entry->length) { > > INIT_LIST_HEAD(&entry->lru); > > zswap_lru_add(&zswap.list_lru, entry); > > atomic_inc(&zswap.nr_stored); > > } > > - spin_unlock(&tree->lock); > > We previously relied on the tree lock to finish initializing the entry > while it's already in tree. Now we rely on something else: > > 1. Concurrent stores and invalidations are excluded by folio lock. > > 2. Writeback is excluded by the entry not being on the LRU yet. > The publishing order matters to prevent writeback from seeing > an incoherent entry. > > I think this deserves a comment. I will add your 1. and 2. into a comment block. Thanks for the suggestion. > > > /* update stats */ > > atomic_inc(&zswap_stored_pages); > > @@ -1585,6 +1510,12 @@ bool zswap_store(struct folio *folio) > > > > return true; > > > > +store_failed: > > + if (!entry->length) { > > + atomic_dec(&zswap_same_filled_pages); > > + goto freepage; > > + } > > It'd be good to avoid the nested goto. Why not make the pool > operations conditional on entry->length instead: > > store_failed: > if (!entry->length) > atomic_dec(&zswap_same_filled_pages); > else { > zpool_free(zswap_find_zpool(...)); > put_pool: > zswap_pool_put(entry->pool); > } > freepage: Sure, I have one internal version exactly like that. I later changed again to get rid of the else. I can use your version as well. > > Not super pretty either, but it's a linear flow at least. Thanks for your suggestions. I will send out a new version. Chris