> -----Original Message----- > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Sent: Monday, January 6, 2025 4:59 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx; > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; > akpm@xxxxxxxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; > herbert@xxxxxxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx; > clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; ebiggers@xxxxxxxxxx; > surenb@xxxxxxxxxx; Accardi, Kristen C <kristen.c.accardi@xxxxxxxxx>; > Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh > <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if > the crypto_alg supports batching. > > On Fri, Dec 20, 2024 at 10:31 PM Kanchana P Sridhar > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > This patch does the following: > > > > 1) Defines ZSWAP_MAX_BATCH_SIZE to denote the maximum number of > acomp_ctx > > batching resources (acomp_reqs and buffers) to allocate if the zswap > > compressor supports batching. Currently, ZSWAP_MAX_BATCH_SIZE is set > to > > 8U. > > > > 2) Modifies the definition of "struct crypto_acomp_ctx" to represent a > > configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to > > "struct crypto_acomp_ctx" to contain the number of resources that will > > be allocated in the cpu hotplug onlining code. > > > > 3) The zswap_cpu_comp_prepare() cpu onlining code will detect if the > > crypto_acomp created for the zswap pool (in other words, the zswap > > compression algorithm) has registered implementations for > > batch_compress() and batch_decompress(). > > This is an implementation detail that is not visible to the zswap > code. Please do not refer to batch_compress() and batch_decompress() > here, just mention that we check if the compressor supports batching. Thanks for the suggestions. Sure, I will modify the commit log accordingly. > > > If so, it will query the > > crypto_acomp for the maximum batch size supported by the compressor, > and > > set "nr_reqs" to the minimum of this compressor-specific max batch size > > and ZSWAP_MAX_BATCH_SIZE. Finally, it will allocate "nr_reqs" > > reqs/buffers, and set the acomp_ctx->nr_reqs accordingly. > > > > 4) If the crypto_acomp does not support batching, "nr_reqs" defaults to 1. > > General note, some implementation details are obvious from the code > and do not need to be explained in the commit log. It's mostly useful > to explain what you are doing from a high level, and why you are doing > it. > > In this case, we should mainly describe that we are adding support for > the per-CPU acomp_ctx to track multiple compression/decompression > requests but are not actually using more than one request yet. Mention > that followup changes will actually utilize this to batch > compression/decompression of multiple pages, and highlight important > implementation details (such as ZSWAP_MAX_BATCH_SIZE limiting the > amount of extra memory we are using for this, and that there is no > extra memory usage for compressors that do not use batching). Sure, will do so. > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> > > --- > > mm/zswap.c | 122 +++++++++++++++++++++++++++++++++++++++-------- > ------ > > 1 file changed, 90 insertions(+), 32 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 9718c33f8192..99cd78891fd0 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -78,6 +78,13 @@ static bool zswap_pool_reached_full; > > > > #define ZSWAP_PARAM_UNSET "" > > > > +/* > > + * For compression batching of large folios: > > + * Maximum number of acomp compress requests that will be processed > > + * in a batch, iff the zswap compressor supports batching. > > + */ > > Please mention that this limit exists because we preallocate enough > requests and buffers accordingly, so a higher limit means higher > memory usage. Ok. > > > +#define ZSWAP_MAX_BATCH_SIZE 8U > > + > > static int zswap_setup(void); > > > > /* Enable/disable zswap */ > > @@ -143,9 +150,10 @@ bool zswap_never_enabled(void) > > > > struct crypto_acomp_ctx { > > struct crypto_acomp *acomp; > > - struct acomp_req *req; > > + struct acomp_req **reqs; > > + u8 **buffers; > > + unsigned int nr_reqs; > > struct crypto_wait wait; > > - u8 *buffer; > > struct mutex mutex; > > bool is_sleepable; > > }; > > @@ -818,49 +826,88 @@ static int zswap_cpu_comp_prepare(unsigned int > cpu, struct hlist_node *node) > > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, > cpu); > > struct crypto_acomp *acomp; > > - struct acomp_req *req; > > - int ret; > > + unsigned int nr_reqs = 1; > > + int ret = -ENOMEM; > > + int i, j; > > > > mutex_init(&acomp_ctx->mutex); > > - > > - acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, > cpu_to_node(cpu)); > > - if (!acomp_ctx->buffer) > > - return -ENOMEM; > > + acomp_ctx->nr_reqs = 0; > > > > acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, > cpu_to_node(cpu)); > > if (IS_ERR(acomp)) { > > pr_err("could not alloc crypto acomp %s : %ld\n", > > pool->tfm_name, PTR_ERR(acomp)); > > - ret = PTR_ERR(acomp); > > - goto acomp_fail; > > + return PTR_ERR(acomp); > > } > > acomp_ctx->acomp = acomp; > > acomp_ctx->is_sleepable = acomp_is_async(acomp); > > > > - req = acomp_request_alloc(acomp_ctx->acomp); > > - if (!req) { > > - pr_err("could not alloc crypto acomp_request %s\n", > > - pool->tfm_name); > > - ret = -ENOMEM; > > + /* > > + * Create the necessary batching resources if the crypto acomp alg > > + * implements the batch_compress and batch_decompress API. > > No mention of the internal implementation of acomp_has_async_batching() > please. Ok. > > > + */ > > + if (acomp_has_async_batching(acomp)) { > > + nr_reqs = min(ZSWAP_MAX_BATCH_SIZE, > crypto_acomp_batch_size(acomp)); > > + pr_info_once("Creating acomp_ctx with %d reqs/buffers for > batching since crypto acomp\n%s has registered batch_compress() and > batch_decompress().\n", > > + nr_reqs, pool->tfm_name); > > This will only be printed once, so if the compressor changes the > information will no longer be up-to-date on all CPUs. I think we > should just drop it. Yes, makes sense. > > > + } > > + > > + acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *), > GFP_KERNEL, cpu_to_node(cpu)); > > Can we use kcalloc_node() here? I was wondering if the performance penalty of the kcalloc_node() is acceptable because the cpu onlining happens infrequently? If so, it appears zero-initializing the allocated memory will help in the cleanup code suggestion in your subsequent comment. > > > + if (!acomp_ctx->buffers) > > + goto buf_fail; > > + > > + for (i = 0; i < nr_reqs; ++i) { > > + acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, > GFP_KERNEL, cpu_to_node(cpu)); > > + if (!acomp_ctx->buffers[i]) { > > + for (j = 0; j < i; ++j) > > + kfree(acomp_ctx->buffers[j]); > > + kfree(acomp_ctx->buffers); > > + ret = -ENOMEM; > > + goto buf_fail; > > + } > > + } > > + > > + acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req > *), GFP_KERNEL, cpu_to_node(cpu)); > > Ditto. Sure. > > > + if (!acomp_ctx->reqs) > > goto req_fail; > > + > > + for (i = 0; i < nr_reqs; ++i) { > > + acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp); > > + if (!acomp_ctx->reqs[i]) { > > + pr_err("could not alloc crypto acomp_request reqs[%d] > %s\n", > > + i, pool->tfm_name); > > + for (j = 0; j < i; ++j) > > + acomp_request_free(acomp_ctx->reqs[j]); > > + kfree(acomp_ctx->reqs); > > + ret = -ENOMEM; > > + goto req_fail; > > + } > > } > > - acomp_ctx->req = req; > > > > + /* > > + * The crypto_wait is used only in fully synchronous, i.e., with scomp > > + * or non-poll mode of acomp, hence there is only one "wait" per > > + * acomp_ctx, with callback set to reqs[0], under the assumption that > > + * there is at least 1 request per acomp_ctx. > > + */ > > crypto_init_wait(&acomp_ctx->wait); > > /* > > * if the backend of acomp is async zip, crypto_req_done() will wakeup > > * crypto_wait_req(); if the backend of acomp is scomp, the callback > > * won't be called, crypto_wait_req() will return without blocking. > > */ > > - acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, > > + acomp_request_set_callback(acomp_ctx->reqs[0], > CRYPTO_TFM_REQ_MAY_BACKLOG, > > crypto_req_done, &acomp_ctx->wait); > > > > + acomp_ctx->nr_reqs = nr_reqs; > > return 0; > > > > req_fail: > > + for (i = 0; i < nr_reqs; ++i) > > + kfree(acomp_ctx->buffers[i]); > > + kfree(acomp_ctx->buffers); > > The cleanup code is all over the place. Sometimes it's done in the > loops allocating the memory and sometimes here. It's a bit hard to > follow. Please have all the cleanups here. You can just initialize the > arrays to 0s, and then if the array is not-NULL you can free any > non-NULL elements (kfree() will handle NULLs gracefully). Sure, if performance of kzalloc_node() is an acceptable trade-off for the cleanup code simplification. > > There may be even potential for code reuse with zswap_cpu_comp_dead(). I assume the reuse will be through copy-and-paste the same lines of code as against a common procedure being called by zswap_cpu_comp_prepare() and zswap_cpu_comp_dead()? > > > +buf_fail: > > crypto_free_acomp(acomp_ctx->acomp); > > -acomp_fail: > > - kfree(acomp_ctx->buffer); > > return ret; > > } > > > > @@ -870,11 +917,22 @@ static int zswap_cpu_comp_dead(unsigned int > cpu, struct hlist_node *node) > > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, > cpu); > > > > if (!IS_ERR_OR_NULL(acomp_ctx)) { > > - if (!IS_ERR_OR_NULL(acomp_ctx->req)) > > - acomp_request_free(acomp_ctx->req); > > + int i; > > + > > + for (i = 0; i < acomp_ctx->nr_reqs; ++i) > > + if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i])) > > + acomp_request_free(acomp_ctx->reqs[i]); > > + kfree(acomp_ctx->reqs); > > + > > + for (i = 0; i < acomp_ctx->nr_reqs; ++i) > > + kfree(acomp_ctx->buffers[i]); > > + kfree(acomp_ctx->buffers); > > + > > if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > > crypto_free_acomp(acomp_ctx->acomp); > > - kfree(acomp_ctx->buffer); > > + > > + acomp_ctx->nr_reqs = 0; > > + acomp_ctx = NULL; > > } > > > > return 0; > > @@ -897,7 +955,7 @@ static bool zswap_compress(struct page *page, > struct zswap_entry *entry, > > > > mutex_lock(&acomp_ctx->mutex); > > > > - dst = acomp_ctx->buffer; > > + dst = acomp_ctx->buffers[0]; > > sg_init_table(&input, 1); > > sg_set_page(&input, page, PAGE_SIZE, 0); > > > > @@ -907,7 +965,7 @@ static bool zswap_compress(struct page *page, > struct zswap_entry *entry, > > * giving the dst buffer with enough length to avoid buffer overflow. > > */ > > sg_init_one(&output, dst, PAGE_SIZE * 2); > > - acomp_request_set_params(acomp_ctx->req, &input, &output, > PAGE_SIZE, dlen); > > + acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, > PAGE_SIZE, dlen); > > > > /* > > * it maybe looks a little bit silly that we send an asynchronous request, > > @@ -921,8 +979,8 @@ static bool zswap_compress(struct page *page, > struct zswap_entry *entry, > > * but in different threads running on different cpu, we have different > > * acomp instance, so multiple threads can do (de)compression in > parallel. > > */ > > - comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx- > >req), &acomp_ctx->wait); > > - dlen = acomp_ctx->req->dlen; > > + comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx- > >reqs[0]), &acomp_ctx->wait); > > + dlen = acomp_ctx->reqs[0]->dlen; > > if (comp_ret) > > goto unlock; > > > > @@ -975,20 +1033,20 @@ static void zswap_decompress(struct > zswap_entry *entry, struct folio *folio) > > */ > > if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) || > > !virt_addr_valid(src)) { > > - memcpy(acomp_ctx->buffer, src, entry->length); > > - src = acomp_ctx->buffer; > > + memcpy(acomp_ctx->buffers[0], src, entry->length); > > + src = acomp_ctx->buffers[0]; > > zpool_unmap_handle(zpool, entry->handle); > > } > > > > sg_init_one(&input, src, entry->length); > > sg_init_table(&output, 1); > > sg_set_folio(&output, folio, PAGE_SIZE, 0); > > - acomp_request_set_params(acomp_ctx->req, &input, &output, entry- > >length, PAGE_SIZE); > > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx- > >req), &acomp_ctx->wait)); > > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > > + acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, > entry->length, PAGE_SIZE); > > + BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx- > >reqs[0]), &acomp_ctx->wait)); > > + BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE); > > mutex_unlock(&acomp_ctx->mutex); > > > > - if (src != acomp_ctx->buffer) > > + if (src != acomp_ctx->buffers[0]) > > zpool_unmap_handle(zpool, entry->handle); > > } > > > > -- > > 2.27.0 > >