On Mon, Jan 6, 2025 at 5:38 PM Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> wrote: > > Hi Yosry, > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > Sent: Monday, January 6, 2025 3:24 PM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > > Cc: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>; linux- > > kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; hannes@xxxxxxxxxxx; > > nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx; > > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; > > akpm@xxxxxxxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; > > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; > > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C > > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; > > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > > Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for > > compress/decompress batching. > > > > On Mon, Jan 6, 2025 at 9:37 AM Sridhar, Kanchana P > > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > > > Hi Herbert, > > > > > > > -----Original Message----- > > > > From: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> > > > > Sent: Saturday, December 28, 2024 3:46 AM > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > > > > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > > > > hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx; > > > > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > > > > ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; akpm@linux- > > foundation.org; > > > > linux-crypto@xxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx; > > clabbe@xxxxxxxxxxxx; > > > > ardb@xxxxxxxxxx; ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, > > > > Kristen C <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K > > > > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > > > > Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for > > > > compress/decompress batching. > > > > > > > > On Fri, Dec 20, 2024 at 10:31:09PM -0800, Kanchana P Sridhar wrote: > > > > > This commit adds get_batch_size(), batch_compress() and > > > > batch_decompress() > > > > > interfaces to: > > > > > > > > First of all we don't need a batch compress/decompress interface > > > > because the whole point of request chaining is to supply the data > > > > in batches. > > > > > > > > I'm also against having a get_batch_size because the user should > > > > be supplying as much data as they're comfortable with. In other > > > > words if the user is happy to give us 8 requests for iaa then it > > > > should be happy to give us 8 requests for every implementation. > > > > > > > > The request chaining interface should be such that processing > > > > 8 requests is always better than doing 1 request at a time as > > > > the cost is amortised. > > > > > > Thanks for your comments. Can you please elaborate on how > > > request chaining would enable cost amortization for software > > > compressors? With the current implementation, a module like > > > zswap would need to do the following to invoke request chaining > > > for software compressors (in addition to pushing the chaining > > > to the user layer for IAA, as per your suggestion on not needing a > > > batch compress/decompress interface): > > > > > > zswap_batch_compress(): > > > for (i = 0; i < nr_pages_in_batch; ++i) { > > > /* set up the acomp_req "reqs[i]". */ > > > [ ... ] > > > if (i) > > > acomp_request_chain(reqs[i], reqs[0]); > > > else > > > acomp_reqchain_init(reqs[0], 0, crypto_req_done, crypto_wait); > > > } > > > > > > /* Process the request chain in series. */ > > > err = crypto_wait_req(acomp_do_req_chain(reqs[0], > > crypto_acomp_compress), crypto_wait); > > > > > > Internally, acomp_do_req_chain() would sequentially process the > > > request chain by: > > > 1) adding all requests to a list "state" > > > 2) call "crypto_acomp_compress()" for the next list element > > > 3) when this request completes, dequeue it from the list "state" > > > 4) repeat for all requests in "state" > > > 5) When the last request in "state" completes, call "reqs[0]- > > >base.complete()", > > > which notifies crypto_wait. > > > > > > From what I can understand, the latency cost should be the same for > > > processing a request chain in series vs. processing each request as it is > > > done today in zswap, by calling: > > > > > > comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx- > > >reqs[0]), &acomp_ctx->wait); > > > > > > It is not clear to me if there is a cost amortization benefit for software > > > compressors. One of the requirements from Yosry was that there should > > > be no change for the software compressors, which is what I have > > > attempted to do in v5. > > > > > > Can you please help us understand if there is a room for optimizing > > > the implementation of the synchronous "acomp_do_req_chain()" API? > > > I would also like to get inputs from the zswap maintainers on using > > > request chaining for a batching implementation for software compressors. > > > > Is there a functional change in doing so, or just using different > > interfaces to accomplish the same thing we do today? > > The code paths for software compressors are considerably different between > these two scenarios: > > 1) Given a batch of 8 pages: for each page, call zswap_compress() that does this: > > comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait); > > 2) Given a batch of 8 pages: > a) Create a request chain of 8 acomp_reqs, starting with reqs[0], as > described earlier. > b) Process the request chain by calling: > > err = crypto_wait_req(acomp_do_req_chain(reqs[0], crypto_acomp_compress), &acomp_ctx->wait); > /* Get each req's error status. */ > for (i = 0; i < nr_pages; ++i) { > errors[i] = acomp_request_err(reqs[i]); > if (errors[i]) { > pr_debug("Request chaining req %d compress error %d\n", i, errors[i]); > } else { > dlens[i] = reqs[i]->dlen; > } > } > > What I mean by considerably different code paths is that request chaining > internally overwrites the req's base.complete and base.data (after saving the > original values) to implement the algorithm described earlier. Basically, the > chain is processed in series by getting the next req in the chain, setting it's > completion function to "acomp_reqchain_done()", which gets called when > the "op" (crypto_acomp_compress()) is completed for that req. > acomp_reqchain_done() will cause the next req to be processed in the > same manner. If this next req happens to be the last req to be processed, > it will notify the original completion function of reqs[0], with the crypto_wait > that zswap sets up in zswap_cpu_comp_prepare(): > > acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG, > crypto_req_done, &acomp_ctx->wait); > > Patch [1] in v5 of this series has the full implementation of acomp_do_req_chain() > in case you want to understand this in more detail. > > The "functional change" wrt request chaining is limited to the above. For software compressors, the batch size should be 1. In that scenario, from a zswap perspective (without going into the acomp implementation details please), is there a functional difference? If not, we can just use the request chaining API regardless of batching if that is what Herbert means. > > [1]: https://patchwork.kernel.org/project/linux-mm/patch/20241221063119.29140-2-kanchana.p.sridhar@xxxxxxxxx/ > > Thanks, > Kanchana >