Hi Nhat - thanks for the review! On 27/10/2023 19:49, Nhat Pham wrote: > On Thu, Oct 19, 2023 at 4:06 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >> >> Previously zswap would refuse to store any folio bigger than order-0, >> and therefore all of those folios would be sent directly to the swap >> file. This is a minor inconvenience since swap can currently only >> support order-0 and PMD-sized THP, but with the pending introduction of >> "small-sized THP", and corresponding changes to swapfile to support any >> order of folio, these large folios will become more prevalent and >> without this zswap change, zswap will become unusable. Independently of >> the "small-sized THP" feature, this change makes it possible to store >> existing PMD-sized THPs in zswap. >> >> Modify zswap_store() to allow storing large folios. The function is >> split into 2 parts; zswap_store() does all the per-folio operations >> (i.e. checking there is enough space, etc). Then it calls a new helper, >> zswap_store_page(), for each page in the folio, which are stored as >> their own entries in the zswap pool. (These entries continue to be >> loaded back individually as single pages). If a store fails for any >> single page, then all previously successfully stored folio pages are >> invalidated. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >> --- >> I've tested this on arm64 (m2) with zswap enabled, and running >> vm-scalability's `usemem` across multiple cores from within a >> memory-constrained memcg to force high volumes of swap. I've also run mm >> selftests and observe no regressions (although there is nothing [z]swap >> specific there) - does zswap have any specific tests I should run? > > There is a zswap kselftest in the cgroup suite: > https://lore.kernel.org/all/20230621153548.428093-1-cerasuolodomenico@xxxxxxxxx/ ahh great - I'll run that against future versions. > > Regardless, I feel like this kind of change is probably better to be tested > via stress tests anyway - allocating a bunch of anon memory with a certain > pattern, making sure they're not corrupted after being zswapped out etc. Yes - that's exactly what the `usemem` test I described above is doing (and its not reporting any corruption). > > >> >> This is based on mm-stable, since mm-unstable contains a zswap patch >> known to be buggy [1]. I thought it would be best to get comments on the >> shape, then do the rebase after that patch has been fixed. >> >> For context, small-sized THP is being discussed here [2], and I'm >> working on changes to swapfile to support non-PMD-sized large folios >> here [3]. >> >> [1] https://lore.kernel.org/linux-mm/21606fe5-fb9b-4d37-98ab-38c96819893b@xxxxxxx/ >> [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@xxxxxxx/ >> [3] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@xxxxxxx/ >> >> Thanks, >> Ryan >> >> >> mm/zswap.c | 155 +++++++++++++++++++++++++++++++++-------------------- >> 1 file changed, 98 insertions(+), 57 deletions(-) >> >> diff --git a/mm/zswap.c b/mm/zswap.c >> index 37d2b1cb2ecb..51cbfc4e1ef8 100644 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -1188,18 +1188,17 @@ static void zswap_fill_page(void *ptr, unsigned long value) >> memset_l(page, value, PAGE_SIZE / sizeof(unsigned long)); >> } >> >> -bool zswap_store(struct folio *folio) >> +static bool zswap_store_page(struct folio *folio, long index, >> + struct obj_cgroup *objcg, struct zswap_pool *pool) >> { >> swp_entry_t swp = folio->swap; >> int type = swp_type(swp); >> - pgoff_t offset = swp_offset(swp); >> - struct page *page = &folio->page; >> + pgoff_t offset = swp_offset(swp) + index; >> + struct page *page = folio_page(folio, index); >> struct zswap_tree *tree = zswap_trees[type]; >> struct zswap_entry *entry, *dupentry; >> struct scatterlist input, output; >> struct crypto_acomp_ctx *acomp_ctx; >> - struct obj_cgroup *objcg = NULL; >> - struct zswap_pool *pool; >> struct zpool *zpool; >> unsigned int dlen = PAGE_SIZE; >> unsigned long handle, value; >> @@ -1208,51 +1207,11 @@ bool zswap_store(struct folio *folio) >> gfp_t gfp; >> int ret; >> >> - VM_WARN_ON_ONCE(!folio_test_locked(folio)); >> - VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); >> - >> - /* Large folios aren't supported */ >> - if (folio_test_large(folio)) >> - return false; >> - >> - if (!zswap_enabled || !tree) >> + /* entry keeps the references if successfully stored. */ >> + if (!zswap_pool_get(pool)) >> return false; >> - >> - /* >> - * If this is a duplicate, it must be removed before attempting to store >> - * it, otherwise, if the store fails the old page won't be removed from >> - * the tree, and it might be written back overriding the new data. >> - */ >> - spin_lock(&tree->lock); >> - dupentry = zswap_rb_search(&tree->rbroot, offset); >> - if (dupentry) { >> - zswap_duplicate_entry++; >> - zswap_invalidate_entry(tree, dupentry); >> - } >> - spin_unlock(&tree->lock); >> - >> - /* >> - * XXX: zswap reclaim does not work with cgroups yet. Without a >> - * cgroup-aware entry LRU, we will push out entries system-wide based on >> - * local cgroup limits. >> - */ >> - objcg = get_obj_cgroup_from_folio(folio); >> - if (objcg && !obj_cgroup_may_zswap(objcg)) >> - goto reject; >> - >> - /* reclaim space if needed */ >> - if (zswap_is_full()) { >> - zswap_pool_limit_hit++; >> - zswap_pool_reached_full = true; >> - goto shrink; >> - } >> - >> - if (zswap_pool_reached_full) { >> - if (!zswap_can_accept()) >> - goto shrink; >> - else >> - zswap_pool_reached_full = false; >> - } >> + if (objcg) >> + obj_cgroup_get(objcg); >> >> /* allocate entry */ >> entry = zswap_entry_cache_alloc(GFP_KERNEL); >> @@ -1260,6 +1219,8 @@ bool zswap_store(struct folio *folio) >> zswap_reject_kmemcache_fail++; >> goto reject; >> } >> + entry->objcg = objcg; >> + entry->pool = pool; >> >> if (zswap_same_filled_pages_enabled) { >> src = kmap_atomic(page); >> @@ -1277,11 +1238,6 @@ bool zswap_store(struct folio *folio) >> if (!zswap_non_same_filled_pages_enabled) >> goto freepage; >> >> - /* if entry is successfully added, it keeps the reference */ >> - entry->pool = zswap_pool_current_get(); >> - if (!entry->pool) >> - goto freepage; >> - >> /* compress */ >> acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); >> >> @@ -1337,7 +1293,6 @@ bool zswap_store(struct folio *folio) >> entry->length = dlen; >> >> insert_entry: >> - entry->objcg = objcg; >> if (objcg) { >> obj_cgroup_charge_zswap(objcg, entry->length); >> /* Account before objcg ref is moved to tree */ >> @@ -1373,19 +1328,105 @@ bool zswap_store(struct folio *folio) >> >> put_dstmem: >> mutex_unlock(acomp_ctx->mutex); >> - zswap_pool_put(entry->pool); >> freepage: >> zswap_entry_cache_free(entry); >> reject: >> if (objcg) >> obj_cgroup_put(objcg); >> + zswap_pool_put(pool); >> return false; >> +} >> >> +bool zswap_store(struct folio *folio) >> +{ >> + long nr_pages = folio_nr_pages(folio); >> + swp_entry_t swp = folio->swap; >> + int type = swp_type(swp); >> + pgoff_t offset = swp_offset(swp); >> + struct zswap_tree *tree = zswap_trees[type]; >> + struct zswap_entry *entry; >> + struct obj_cgroup *objcg = NULL; >> + struct zswap_pool *pool; >> + bool ret = false; >> + long i; >> + >> + VM_WARN_ON_ONCE(!folio_test_locked(folio)); >> + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); >> + >> + if (!zswap_enabled || !tree) >> + return false; >> + >> + /* >> + * If this is a duplicate, it must be removed before attempting to store >> + * it, otherwise, if the store fails the old page won't be removed from >> + * the tree, and it might be written back overriding the new data. >> + */ >> + spin_lock(&tree->lock); >> + for (i = 0; i < nr_pages; i++) { >> + entry = zswap_rb_search(&tree->rbroot, offset + i); >> + if (entry) { >> + zswap_duplicate_entry++; >> + zswap_invalidate_entry(tree, entry); >> + } >> + } >> + spin_unlock(&tree->lock); >> + >> + /* >> + * XXX: zswap reclaim does not work with cgroups yet. Without a >> + * cgroup-aware entry LRU, we will push out entries system-wide based on >> + * local cgroup limits. >> + */ >> + objcg = get_obj_cgroup_from_folio(folio); >> + if (objcg && !obj_cgroup_may_zswap(objcg)) >> + goto put_objcg; > > Hmm would this make more sense to check these limits > at a higher order page level or at a backing page (4 KB) > level? > > What if the cgroup still has some space before the new > folio comes in, but the new folio would drive the pool size > beyond the limit? Ditto for global zswap pool limit. I did consider this, but I thought I would keep it simple for the RFC and accept that we may go over the limits by (HPAGE_PMD_NR - 1) pages. It sounds like you don't think that will be acceptable. I see 2 options: - move this checking logic to be per-page in zswap_store_page() - pass a size (or nr_pages or order) to obj_cgroup_may_zswap(), zswap_is_full() and zswap_can_accept() so they know how much we are proposing to add and can make a correct decision. Personally I prefer the second option for efficiency. > > Previously, the object size is capped by the size of > a page (since we don't accept bigger size pages into > zswap). If zswap limit is exceded, it will not be exceeded > by more than 4 KB. No big deal. But I'm not sure > this will be OK as we move to bigger and bigger > sizes for the pages... > > If we do decide to check the cgroup's zswap limit for > each backing page, I'm not sure how the reclaim logic > (which will be introduced in the cgroup-aware zswap > LRU patch) will interact with this though. Ok so this points us in the direction of my option 2 above as well? > >> + >> + /* reclaim space if needed */ >> + if (zswap_is_full()) { >> + zswap_pool_limit_hit++; >> + zswap_pool_reached_full = true; >> + goto shrink; >> + } >> + >> + if (zswap_pool_reached_full) { >> + if (!zswap_can_accept()) >> + goto shrink; >> + else >> + zswap_pool_reached_full = false; >> + } >> + >> + pool = zswap_pool_current_get(); >> + if (!pool) >> + goto put_objcg; >> + >> + /* >> + * Store each page of the folio as a separate entry. If we fail to store >> + * a page, unwind by removing all the previous pages we stored. >> + */ >> + for (i = 0; i < nr_pages; i++) { >> + if (!zswap_store_page(folio, i, objcg, pool)) { >> + spin_lock(&tree->lock); >> + for (i--; i >= 0; i--) { >> + entry = zswap_rb_search(&tree->rbroot, offset + i); >> + if (entry) >> + zswap_invalidate_entry(tree, entry); >> + } >> + spin_unlock(&tree->lock); >> + goto put_pool; >> + } >> + } >> + >> + ret = true; >> +put_pool: >> + zswap_pool_put(pool); >> +put_objcg: >> + if (objcg) >> + obj_cgroup_put(objcg); >> + return ret; >> shrink: >> pool = zswap_pool_last_get(); >> if (pool && !queue_work(shrink_wq, &pool->shrink_work)) >> zswap_pool_put(pool); >> - goto reject; >> + goto put_objcg; >> } >> >> bool zswap_load(struct folio *folio) >> >> base-commit: 158978945f3173b8c1a88f8c5684a629736a57ac >> -- >> 2.25.1 >> >> > > I don't see anything else that is obviously wrong with this. > Seems straightforward to me. > > But I'm not too super familiar with THP swapping > logic either :) So maybe someone with exposure to both > should take a look too. > > And would it make sense to introduce a gate that guard > this so that users can opt-in/opt-out of this new feature, > at least as we experiment more with it to get more > data? There is already a gate at a higher level; CONFIG_THP_SWAP. If that is not enabled, then all large folios will be split to single pages before a swap entry is even allocated. I considered adding a separate gate for this, but given the above control, I didn't think the zswap-specific control was really neccessary. What do you think? Thanks, Ryan