> -----Original Message----- > From: Johannes Weiner <hannes@xxxxxxxxxxx> > Sent: Thursday, November 7, 2024 10:16 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx; > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > ryan.roberts@xxxxxxx; Huang, Ying <ying.huang@xxxxxxxxx>; > 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux- > crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C > <kristen.c.accardi@xxxxxxxxx>; zanussi@xxxxxxxxxx; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA > in zswap_store() of large folios. > > On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote: > > If the system has Intel IAA, and if sysctl vm.compress-batching is set to > > "1", zswap_store() will call crypto_acomp_batch_compress() to compress up > > to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using > > the multiple compress engines available in IAA hardware. > > > > On platforms with multiple IAA devices per socket, compress jobs from all > > cores in a socket will be distributed among all IAA devices on the socket > > by the iaa_crypto driver. > > > > With deflate-iaa configured as the zswap compressor, and > > sysctl vm.compress-batching is enabled, the first time zswap_store() has to > > swapout a large folio on any given cpu, it will allocate the > > pool->acomp_batch_ctx resources on that cpu, and set pool- > >can_batch_comp > > to BATCH_COMP_ENABLED. It will then proceed to call the main > > __zswap_store_batch_core() compress batching function. Subsequent calls > to > > zswap_store() on the same cpu will go ahead and use the acomp_batch_ctx > by > > checking the pool->can_batch_comp status. > > > > Hence, we allocate the per-cpu pool->acomp_batch_ctx resources only on > an > > as-needed basis, to reduce memory footprint cost. The cost is not incurred > > on cores that never get to swapout a large folio. > > > > This patch introduces the main __zswap_store_batch_core() function for > > compress batching. This interface represents the extensible compress > > batching architecture that can potentially be called with a batch of > > any-order folios from shrink_folio_list(). In other words, although > > zswap_store() calls __zswap_store_batch_core() with exactly one large folio > > in this patch, we can reuse this interface to reclaim a batch of folios, to > > significantly improve the reclaim path efficiency due to IAA's parallel > > compression capability. > > > > The newly added functions that implement batched stores follow the > > general structure of zswap_store() of a large folio. Some amount of > > restructuring and optimization is done to minimize failure points > > for a batch, fail early and maximize the zswap store pipeline occupancy > > with SWAP_CRYPTO_BATCH_SIZE pages, potentially from multiple > > folios. This is intended to maximize reclaim throughput with the IAA > > hardware parallel compressions. > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> > > --- > > include/linux/zswap.h | 84 ++++++ > > mm/zswap.c | 625 > ++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 709 insertions(+) > > > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > > index 9ad27ab3d222..6d3ef4780c69 100644 > > --- a/include/linux/zswap.h > > +++ b/include/linux/zswap.h > > @@ -31,6 +31,88 @@ struct zswap_lruvec_state { > > atomic_long_t nr_disk_swapins; > > }; > > > > +/* > > + * struct zswap_store_sub_batch_page: > > + * > > + * This represents one "zswap batching element", namely, the > > + * attributes associated with a page in a large folio that will > > + * be compressed and stored in zswap. The term "batch" is reserved > > + * for a conceptual "batch" of folios that can be sent to > > + * zswap_store() by reclaim. The term "sub-batch" is used to describe > > + * a collection of "zswap batching elements", i.e., an array of > > + * "struct zswap_store_sub_batch_page *". > > + * > > + * The zswap compress sub-batch size is specified by > > + * SWAP_CRYPTO_BATCH_SIZE, currently set as 8UL if the > > + * platform has Intel IAA. This means zswap can store a large folio > > + * by creating sub-batches of up to 8 pages and compressing this > > + * batch using IAA to parallelize the 8 compress jobs in hardware. > > + * For e.g., a 64KB folio can be compressed as 2 sub-batches of > > + * 8 pages each. This can significantly improve the zswap_store() > > + * performance for large folios. > > + * > > + * Although the page itself is represented directly, the structure > > + * adds a "u8 batch_idx" to represent an index for the folio in a > > + * conceptual "batch of folios" that can be passed to zswap_store(). > > + * Conceptually, this allows for up to 256 folios that can be passed > > + * to zswap_store(). If this conceptual number of folios sent to > > + * zswap_store() exceeds 256, the "batch_idx" needs to become u16. > > + */ > > +struct zswap_store_sub_batch_page { > > + u8 batch_idx; > > + swp_entry_t swpentry; > > + struct obj_cgroup *objcg; > > + struct zswap_entry *entry; > > + int error; /* folio error status. */ > > +}; > > + > > +/* > > + * struct zswap_store_pipeline_state: > > + * > > + * This stores state during IAA compress batching of (conceptually, a batch > of) > > + * folios. The term pipelining in this context, refers to breaking down > > + * the batch of folios being reclaimed into sub-batches of > > + * SWAP_CRYPTO_BATCH_SIZE pages, batch compressing and storing the > > + * sub-batch. This concept could be further evolved to use overlap of CPU > > + * computes with IAA computes. For instance, we could stage the post- > compress > > + * computes for sub-batch "N-1" to happen in parallel with IAA batch > > + * compression of sub-batch "N". > > + * > > + * We begin by developing the concept of compress batching. Pipelining > with > > + * overlap can be future work. > > + * > > + * @errors: The errors status for the batch of reclaim folios passed in from > > + * a higher mm layer such as swap_writepage(). > > + * @pool: A valid zswap_pool. > > + * @acomp_ctx: The per-cpu pointer to the crypto_acomp_ctx for the > @pool. > > + * @sub_batch: This is an array that represents the sub-batch of up to > > + * SWAP_CRYPTO_BATCH_SIZE pages that are being stored > > + * in zswap. > > + * @comp_dsts: The destination buffers for crypto_acomp_compress() for > each > > + * page being compressed. > > + * @comp_dlens: The destination buffers' lengths from > crypto_acomp_compress() > > + * obtained after crypto_acomp_poll() returns completion status, > > + * for each page being compressed. > > + * @comp_errors: Compression errors for each page being compressed. > > + * @nr_comp_pages: Total number of pages in @sub_batch. > > + * > > + * Note: > > + * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL. > > + * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, some of the > > + * u8 members (except @comp_dsts) need to become u16. > > + */ > > +struct zswap_store_pipeline_state { > > + int *errors; > > + struct zswap_pool *pool; > > + struct crypto_acomp_ctx *acomp_ctx; > > + struct zswap_store_sub_batch_page *sub_batch; > > + struct page **comp_pages; > > + u8 **comp_dsts; > > + unsigned int *comp_dlens; > > + int *comp_errors; > > + u8 nr_comp_pages; > > +}; > > Why are these in the public header? Thanks Johannes, for the detailed code review comments! Yes, these don't need to belong in the public header. I will move them to zswap.c. > > > unsigned long zswap_total_pages(void); > > bool zswap_store(struct folio *folio); > > bool zswap_load(struct folio *folio); > > @@ -45,6 +127,8 @@ bool zswap_never_enabled(void); > > #else > > > > struct zswap_lruvec_state {}; > > +struct zswap_store_sub_batch_page {}; > > +struct zswap_store_pipeline_state {}; > > > > static inline bool zswap_store(struct folio *folio) > > { > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 2af736e38213..538aac3fb552 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -255,6 +255,12 @@ static int zswap_create_acomp_ctx(unsigned int > cpu, > > char *tfm_name, > > unsigned int nr_reqs); > > > > +static bool __zswap_store_batch_core( > > + int node_id, > > + struct folio **folios, > > + int *errors, > > + unsigned int nr_folios); > > + > > Please reorder the functions to avoid forward decls. Sure. > > > /********************************* > > * pool functions > > **********************************/ > > @@ -1626,6 +1632,12 @@ static ssize_t zswap_store_page(struct page > *page, > > return -EINVAL; > > } > > > > +/* > > + * Modified to use the IAA compress batching framework implemented in > > + * __zswap_store_batch_core() if sysctl vm.compress-batching is 1. > > + * The batching code is intended to significantly improve folio store > > + * performance over the sequential code. > > This isn't helpful, please delete. Ok. > > > bool zswap_store(struct folio *folio) > > { > > long nr_pages = folio_nr_pages(folio); > > @@ -1638,6 +1650,38 @@ bool zswap_store(struct folio *folio) > > bool ret = false; > > long index; > > > > + /* > > + * Improve large folio zswap_store() latency with IAA compress > batching, > > + * if this is enabled by setting sysctl vm.compress-batching to "1". > > + * If enabled, the large folio's pages are compressed in parallel in > > + * batches of SWAP_CRYPTO_BATCH_SIZE pages. If disabled, every > page in > > + * the large folio is compressed sequentially. > > + */ > > Same here. Reduce to "Try to batch compress large folios, fall back to > processing individual subpages if that fails." Ok. > > > + if (folio_test_large(folio) && READ_ONCE(compress_batching)) { > > + pool = zswap_pool_current_get(); > > There is an existing zswap_pool_current_get() in zswap_store(), please > reorder the sequence so you don't need to add an extra one. Ok, will do this. I was trying to make the code less messy, but will find a cleaner way. > > > + if (!pool) { > > + pr_err("Cannot setup acomp_batch_ctx for compress > batching: no current pool found\n"); > > This is unnecessary. Ok. > > > + goto sequential_store; > > + } > > + > > + if (zswap_pool_can_batch(pool)) { > > This function is introduced in another patch, where it isn't > used. Please add functions and callers in the same patch. Ok. Unintended side effects of trying to break down the changes into smaller commits. Will address this in v4. > > > + int error = -1; > > + bool store_batch = __zswap_store_batch_core( > > + folio_nid(folio), > > + &folio, &error, 1); > > + > > + if (store_batch) { > > + zswap_pool_put(pool); > > + if (!error) > > + ret = true; > > + return ret; > > + } > > + } > > Please don't future proof code like this, only implement what is > strictly necessary for the functionality in this patch. You're only > adding a single caller with nr_folios=1, so it shouldn't be a > parameter, and the function shouldn't have a that batch_idx loop. Ok. > > > + zswap_pool_put(pool); > > + } > > + > > +sequential_store: > > + > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > > > @@ -1724,6 +1768,587 @@ bool zswap_store(struct folio *folio) > > return ret; > > } > > > > +/* > > + * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the > > + * u8 stack variables in the next several functions, to u16. > > + */ > > + > > +/* > > + * Propagate the "sbp" error condition to other batch elements belonging > to > > + * the same folio as "sbp". > > + */ > > +static __always_inline void zswap_store_propagate_errors( > > + struct zswap_store_pipeline_state *zst, > > + u8 error_batch_idx) > > +{ > > Please observe surrounding coding style on how to wrap >80 col > function signatures. I see. Ok. > > Don't use __always_inline unless there is a clear, spelled out > performance reason. Since it's an error path, that's doubtful. The motivation was incompressible pages, but sure, will address in v4. > > Please use a consistent namespace for all this: > > CONFIG_ZSWAP_BATCH > zswap_batch_store() > zswap_batch_alloc_entries() > zswap_batch_add_folios() > zswap_batch_compress() > > etc. > > Again, order to avoid forward decls. > > Try to keep the overall sequence of events between zswap_store() and > zswap_batch_store() similar as much as possible for readability. Definitely. Thanks, Kanchana