[..] > > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > > -{ > > > - rb_erase(&entry->rbnode, root); > > > - RB_CLEAR_NODE(&entry->rbnode); > > > + e = xa_store(tree, offset, entry, GFP_KERNEL); > > > + err = xa_err(e); > > > + > > > + if (err) { > > > + e = xa_erase(tree, offset); > > > + if (err == -ENOMEM) > > > + zswap_reject_alloc_fail++; > > > + else > > > + zswap_reject_xarray_fail++; > > > > I think this is too complicated, and as Chengming pointed out, I believe > > we can use xa_store() directly in zswap_store(). > > Sure. > > > I am also not sure what the need for zswap_reject_xarray_fail is. Are > > there any reasons why the store here can fail other than -ENOMEM? The > > docs say the only other option is -EINVAL, and looking at __xa_store(), > > it seems like this is only possible if xa_is_internal() is true (which > > means we are not passing in a properly aligned pointer IIUC). > > Because the xa_store document said it can return two error codes. I > see zswap try to classify the error count it hit, that is why I add > the zswap_reject_xarray_fail. Right, but I think we should not get -EINVAL in this case. I think it would be more appropriate to have WARN_ON() or VM_WARN_ON() in this case? [..] > > > @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page) > > > static int zswap_writeback_entry(struct zswap_entry *entry, > > > swp_entry_t swpentry) > > > { > > > - struct zswap_tree *tree; > > > + struct xarray *tree; > > > + pgoff_t offset = swp_offset(swpentry); > > > + struct zswap_entry *e; > > > struct folio *folio; > > > struct mempolicy *mpol; > > > bool folio_was_allocated; > > > @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > * be dereferenced. > > > */ > > > tree = swap_zswap_tree(swpentry); > > > - spin_lock(&tree->lock); > > > - if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) { > > > - spin_unlock(&tree->lock); > > > + e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL); > > > + if (e != entry) { > > > > I think we can avoid adding 'e' and 'offset' local variables here and > > just do everything in the if condition. If you want to avoid the line > > break, then introducing 'offset' is fine, but I don't see any value from > > 'e'. > > As I said in my other email. I don't think having this type of local > variable affects the compiler negatively. The compiler generally uses > their own local variable to track the expression anyway. So I am not > sure about the motivation to remove local variables alone, if it helps > the reading. I feel the line "if (xa_cmpxchg(tree, offset, entry, > NULL, GFP_KERNEL) != entry)" is too long and complicated inside the if > condition. That is just me. Not a big deal. I just think 'e' is not providing any readability improvements. If anything, people need to pay closer attention to figure out 'e' is only a temp variable and 'entry' is the real deal. I vote for: if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) [..] > > > @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio) > > > { > > > swp_entry_t swp = folio->swap; > > > pgoff_t offset = swp_offset(swp); > > > - struct zswap_tree *tree = swap_zswap_tree(swp); > > > - struct zswap_entry *entry, *dupentry; > > > + struct xarray *tree = swap_zswap_tree(swp); > > > + struct zswap_entry *entry, *old; > > > struct obj_cgroup *objcg = NULL; > > > struct mem_cgroup *memcg = NULL; > > > + int err; > > > + bool old_erased = false; > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > > @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio) > > > kunmap_local(src); > > > entry->length = 0; > > > entry->value = value; > > > + entry->pool = NULL; > > > > Why do we need to initialize the pool here? Is this is a bug fix for an > > existing problem or just keeping things clean? Either way I think it > > should be done separately, unless it is related to a change in this > > patch. > > I notice the entry->pool will leave uninitialized. I think it should > be cleaned up. It is a clean up, it does not need to happen in this > patch. I can do that as a separate patch. Yes please. [..] > > > > > /* > > > * The folio may have been dirtied again, invalidate the > > > * possibly stale entry before inserting the new entry. > > > */ > > > - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { > > > - zswap_invalidate_entry(tree, dupentry); > > > - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry)); > > > + err = zswap_xa_insert(tree, entry, &old); > > > + if (old) > > > + zswap_entry_free(old); > > > + if (err) { > > > + old_erased = true; > > > > I think this can be made simpler if we open code xa_store() here, > > especially that we already have cleanup code below under 'check_old' > > that removes the exisitng old entry. So zswap_xa_insert() replicates > > this cleanup, then we add this 'old_erased' boolean to avoid doing the > > cleanup below. It seems like it would much more straightforward with > > open-coding xa_store() here and relying on the existing cleanup for the > > old entry. Also, if we initialize 'old' to NULL, we can use its value > > to figure out whether any cleanup is needed under 'check_old' or not. > > I think that is very similar to what Chengming was suggesting. > > > > > Taking a step back, I think we can further simplify this. What if we > > move the tree insertion to right after we allocate the zswap entry? In > > this case, if the tree insertion fails, we don't need to decrement the > > same filled counter. If the tree insertion succeeds and then something > > else fails, the existing cleanup code under 'check_old' will already > > clean up the tree insertion for us. > > That will create complications that, if the zswap compression fails > the compression ratio, you will have to remove the entry from the tree > as clean up. You have both xa_store() and xa_erase() where the current > code just does one xa_erase() on compression failure. Not really. If xa_store() fails because of -ENOMEM, then I think by definition we do not need xa_erase() as there shouldn't be any stale entries. I also think -ENOMEM should be the only valid errno from xa_store() in this context. So we can avoid the check_old code if xa_store() is called (whether it fails or succeeds) IIUC. I prefer calling xa_store() entry and avoiding the extra 'insert_failed' cleanup code, especially that unlike other cleanup code, it has its own branching based on entry->length. I am also planning a cleanup for zswap_store() to split the code better for the same_filled case and avoid some unnecessary checks and failures, so it would be useful to keep the common code path together. > > > > > If this works, we don't need to add extra cleanup code or move any code > > around. Something like: > > Due to the extra xa_insert() on compression failure, I think > Chengming's or your earlier suggestion is better. > > BTW, while you are here, can you confirm this race discussed in > earlier email can't happen? Chengming convinced me this shouldn't > happen. Like to hear your thoughts. > > CPU1 CPU2 > > xa_store() > entry = xa_erase() > zswap_free_entry(entry) > > if (entry->length) > ... > CPU1 is using entry after free. IIUC, CPU1 is in zswap_store(), CPU2 could either in zswap_invalidate() or zswap_load(). For zswap_load(), I think synchronization is done in the core swap code ensure we are not doing parallel swapin/swapout at the same entry, right? In this specific case, I think the folio would be in the swapcache while swapout (i.e. zswap_store()) is ongoing, so any swapins will read the folio and not call zswap_load(). Actually, if we do not prevent parallel swapin/swapou at the same entry, I suspect we may have problems even outside of zswap. For example, we may read a partially written swap entry from disk, right? Or does the block layer synchronize this somehow? For zswap_invalidate(), the core swap code calls it when the swap entry is no longer used and before we free it for reuse, so IIUC parallel swapouts (i.e. zswap_store()) should not be possible here as well.