> -----Original Message----- > From: Johannes Weiner <hannes@xxxxxxxxxxx> > Sent: Thursday, November 7, 2024 10:54 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx; > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > ryan.roberts@xxxxxxx; Huang, Ying <ying.huang@xxxxxxxxx>; > 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux- > crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C > <kristen.c.accardi@xxxxxxxxx>; zanussi@xxxxxxxxxx; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA > in zswap_store() of large folios. > > On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote: > > +static void zswap_zpool_store_sub_batch( > > + struct zswap_store_pipeline_state *zst) > > There is a zswap_store_sub_batch() below, which does something > completely different. Naming is hard, but please invest a bit more > time into this to make this readable. Thanks Johannes, for the comments. Yes, I agree the naming could be better. > > > +{ > > + u8 i; > > + > > + for (i = 0; i < zst->nr_comp_pages; ++i) { > > + struct zswap_store_sub_batch_page *sbp = &zst- > >sub_batch[i]; > > + struct zpool *zpool; > > + unsigned long handle; > > + char *buf; > > + gfp_t gfp; > > + int err; > > + > > + /* Skip pages that had compress errors. */ > > + if (sbp->error) > > + continue; > > + > > + zpool = zst->pool->zpool; > > + gfp = __GFP_NORETRY | __GFP_NOWARN | > __GFP_KSWAPD_RECLAIM; > > + if (zpool_malloc_support_movable(zpool)) > > + gfp |= __GFP_HIGHMEM | __GFP_MOVABLE; > > + err = zpool_malloc(zpool, zst->comp_dlens[i], gfp, &handle); > > + > > + if (err) { > > + if (err == -ENOSPC) > > + zswap_reject_compress_poor++; > > + else > > + zswap_reject_alloc_fail++; > > + > > + /* > > + * An error should be propagated to other pages of > the > > + * same folio in the sub-batch, and zpool resources for > > + * those pages (in sub-batch order prior to this zpool > > + * error) should be de-allocated. > > + */ > > + zswap_store_propagate_errors(zst, sbp->batch_idx); > > + continue; > > + } > > + > > + buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO); > > + memcpy(buf, zst->comp_dsts[i], zst->comp_dlens[i]); > > + zpool_unmap_handle(zpool, handle); > > + > > + sbp->entry->handle = handle; > > + sbp->entry->length = zst->comp_dlens[i]; > > + } > > +} > > + > > +/* > > + * Returns true if the entry was successfully > > + * stored in the xarray, and false otherwise. > > + */ > > +static bool zswap_store_entry(swp_entry_t page_swpentry, > > + struct zswap_entry *entry) > > +{ > > + struct zswap_entry *old = > xa_store(swap_zswap_tree(page_swpentry), > > + swp_offset(page_swpentry), > > + entry, GFP_KERNEL); > > + if (xa_is_err(old)) { > > + int err = xa_err(old); > > + > > + WARN_ONCE(err != -ENOMEM, "unexpected xarray error: > %d\n", err); > > + zswap_reject_alloc_fail++; > > + return false; > > + } > > + > > + /* > > + * We may have had an existing entry that became stale when > > + * the folio was redirtied and now the new version is being > > + * swapped out. Get rid of the old. > > + */ > > + if (old) > > + zswap_entry_free(old); > > + > > + return true; > > +} > > + > > +static void zswap_batch_compress_post_proc( > > + struct zswap_store_pipeline_state *zst) > > +{ > > + int nr_objcg_pages = 0, nr_pages = 0; > > + struct obj_cgroup *objcg = NULL; > > + size_t compressed_bytes = 0; > > + u8 i; > > + > > + zswap_zpool_store_sub_batch(zst); > > + > > + for (i = 0; i < zst->nr_comp_pages; ++i) { > > + struct zswap_store_sub_batch_page *sbp = &zst- > >sub_batch[i]; > > + > > + if (sbp->error) > > + continue; > > + > > + if (!zswap_store_entry(sbp->swpentry, sbp->entry)) { > > + zswap_store_propagate_errors(zst, sbp->batch_idx); > > + continue; > > + } > > + > > + /* > > + * The entry is successfully compressed and stored in the tree, > > + * there is no further possibility of failure. Grab refs to the > > + * pool and objcg. These refs will be dropped by > > + * zswap_entry_free() when the entry is removed from the > tree. > > + */ > > + zswap_pool_get(zst->pool); > > + if (sbp->objcg) > > + obj_cgroup_get(sbp->objcg); > > + > > + /* > > + * We finish initializing the entry while it's already in xarray. > > + * This is safe because: > > + * > > + * 1. Concurrent stores and invalidations are excluded by folio > > + * lock. > > + * > > + * 2. Writeback is excluded by the entry not being on the LRU > yet. > > + * The publishing order matters to prevent writeback from > seeing > > + * an incoherent entry. > > + */ > > + sbp->entry->pool = zst->pool; > > + sbp->entry->swpentry = sbp->swpentry; > > + sbp->entry->objcg = sbp->objcg; > > + sbp->entry->referenced = true; > > + if (sbp->entry->length) { > > + INIT_LIST_HEAD(&sbp->entry->lru); > > + zswap_lru_add(&zswap_list_lru, sbp->entry); > > + } > > + > > + if (!objcg && sbp->objcg) { > > + objcg = sbp->objcg; > > + } else if (objcg && sbp->objcg && (objcg != sbp->objcg)) { > > + obj_cgroup_charge_zswap(objcg, > compressed_bytes); > > + count_objcg_events(objcg, ZSWPOUT, > nr_objcg_pages); > > + compressed_bytes = 0; > > + nr_objcg_pages = 0; > > + objcg = sbp->objcg; > > + } > > + > > + if (sbp->objcg) { > > + compressed_bytes += sbp->entry->length; > > + ++nr_objcg_pages; > > + } > > + > > + ++nr_pages; > > + } /* for sub-batch pages. */ > > + > > + if (objcg) { > > + obj_cgroup_charge_zswap(objcg, compressed_bytes); > > + count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages); > > + } > > + > > + atomic_long_add(nr_pages, &zswap_stored_pages); > > + count_vm_events(ZSWPOUT, nr_pages); > > +} > > + > > +static void zswap_store_sub_batch(struct zswap_store_pipeline_state > *zst) > > +{ > > + u8 i; > > + > > + for (i = 0; i < zst->nr_comp_pages; ++i) { > > + zst->comp_dsts[i] = zst->acomp_ctx->buffers[i]; > > + zst->comp_dlens[i] = PAGE_SIZE; > > + } /* for sub-batch pages. */ > > + > > + /* > > + * Batch compress sub-batch "N". If IAA is the compressor, the > > + * hardware will compress multiple pages in parallel. > > + */ > > + zswap_compress_batch(zst); > > + > > + zswap_batch_compress_post_proc(zst); > > The control flow here is a mess. Keep loops over the same batch at the > same function level. IOW, pull the nr_comp_pages loop out of > zswap_batch_compress_post_proc() and call the function from the loop. I see. Got it. > > Also give it a more descriptive name. If that's hard to do, then > you're probably doing too many different things in it. Create > functions for a specific purpose, don't carve up sequences at > arbitrary points. > > My impression after trying to read this is that the existing > zswap_store() sequence could be a subset of the batched store, where > you can reuse most code to get the pool, charge the cgroup, allocate > entries, store entries, bump the stats etc. for both cases. Alas, your > naming choices make it a bit difficult to be sure. Apologies for the naming choices. I will fix this. As I was trying to explain in the commit log, my goal was to minimize failure points, since each failure point requires unwinding state, which adds latency. Towards this goal, I tried to alloc all entries upfront, and fail early to prevent unwinding state. Especially since the upfront work being done for the batch, is needed in any case (e.g. zswap_alloc_entries()). This is where the trade-offs between treating the existing zswap_store() sequence as a subset of the batched store are not very clear. I tried to optimize the batched store for batching, while following the logical structure of zswap_store(). > > Please explore this direction. Don't worry about the CONFIG symbol for > now, we can still look at this later. Definitely, I will think some more about this. > > Right now, it's basically > > if (special case) > lots of duplicative code in slightly different order > regular store sequence > > and that isn't going to be maintainable. > > Look for a high-level sequence that makes sense for both cases. E.g.: > > if (!zswap_enabled) > goto check_old; > > get objcg > > check limits > > allocate memcg list lru > > for each batch { > for each entry { > allocate entry > acquire objcg ref > acquire pool ref > } > compress > for each entry { > store in tree > add to lru > bump stats and counters > } > } > > put objcg > > return true; > > check_error: > ... > > and then set up the two loops such that they also makes sense when the > folio is just a single page. Thanks for this suggestion! I will explore this kind of structure. I hope I have provided some explanations as to why I pursued the existing batching structure. One other thing I wanted to add was the "future proofing" you alluded to earlier (which I will fix). Many of my design choices were motivated by minimizing latency gaps (e.g. from state unwinding in case of errors) in the batch compress pipeline when a reclaim batch of any-order folios is potentially sent to zswap. Thanks again, Kanchana