Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Mon, 6 Jan 2025 17:46:01 -0800

On Mon, Jan 6, 2025 at 5:38 PM Sridhar, Kanchana P
<kanchana.p.sridhar@xxxxxxxxx> wrote:
>
> Hi Yosry,
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > Sent: Monday, January 6, 2025 3:24 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > Cc: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>; linux-
> > kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; hannes@xxxxxxxxxxx;
> > nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
> > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> > akpm@xxxxxxxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx;
> > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>;
> > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for
> > compress/decompress batching.
> >
> > On Mon, Jan 6, 2025 at 9:37 AM Sridhar, Kanchana P
> > <kanchana.p.sridhar@xxxxxxxxx> wrote:
> > >
> > > Hi Herbert,
> > >
> > > > -----Original Message-----
> > > > From: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> > > > Sent: Saturday, December 28, 2024 3:46 AM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > > > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> > > > hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx;
> > > > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx;
> > > > ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; akpm@linux-
> > foundation.org;
> > > > linux-crypto@xxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx;
> > clabbe@xxxxxxxxxxxx;
> > > > ardb@xxxxxxxxxx; ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi,
> > > > Kristen C <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K
> > > > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > > > Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for
> > > > compress/decompress batching.
> > > >
> > > > On Fri, Dec 20, 2024 at 10:31:09PM -0800, Kanchana P Sridhar wrote:
> > > > > This commit adds get_batch_size(), batch_compress() and
> > > > batch_decompress()
> > > > > interfaces to:
> > > >
> > > > First of all we don't need a batch compress/decompress interface
> > > > because the whole point of request chaining is to supply the data
> > > > in batches.
> > > >
> > > > I'm also against having a get_batch_size because the user should
> > > > be supplying as much data as they're comfortable with.  In other
> > > > words if the user is happy to give us 8 requests for iaa then it
> > > > should be happy to give us 8 requests for every implementation.
> > > >
> > > > The request chaining interface should be such that processing
> > > > 8 requests is always better than doing 1 request at a time as
> > > > the cost is amortised.
> > >
> > > Thanks for your comments. Can you please elaborate on how
> > > request chaining would enable cost amortization for software
> > > compressors? With the current implementation, a module like
> > > zswap would need to do the following to invoke request chaining
> > > for software compressors (in addition to pushing the chaining
> > > to the user layer for IAA, as per your suggestion on not needing a
> > > batch compress/decompress interface):
> > >
> > > zswap_batch_compress():
> > >    for (i = 0; i < nr_pages_in_batch; ++i) {
> > >       /* set up the acomp_req "reqs[i]". */
> > >       [ ... ]
> > >       if (i)
> > >         acomp_request_chain(reqs[i], reqs[0]);
> > >       else
> > >         acomp_reqchain_init(reqs[0], 0, crypto_req_done, crypto_wait);
> > >    }
> > >
> > >    /* Process the request chain in series. */
> > >    err = crypto_wait_req(acomp_do_req_chain(reqs[0],
> > crypto_acomp_compress), crypto_wait);
> > >
> > > Internally, acomp_do_req_chain() would sequentially process the
> > > request chain by:
> > > 1) adding all requests to a list "state"
> > > 2) call "crypto_acomp_compress()" for the next list element
> > > 3) when this request completes, dequeue it from the list "state"
> > > 4) repeat for all requests in "state"
> > > 5) When the last request in "state" completes, call "reqs[0]-
> > >base.complete()",
> > >     which notifies crypto_wait.
> > >
> > > From what I can understand, the latency cost should be the same for
> > > processing a request chain in series vs. processing each request as it is
> > > done today in zswap, by calling:
> > >
> > >   comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> > >reqs[0]), &acomp_ctx->wait);
> > >
> > > It is not clear to me if there is a cost amortization benefit for software
> > > compressors. One of the requirements from Yosry was that there should
> > > be no change for the software compressors, which is what I have
> > > attempted to do in v5.
> > >
> > > Can you please help us understand if there is a room for optimizing
> > > the implementation of the synchronous "acomp_do_req_chain()" API?
> > > I would also like to get inputs from the zswap maintainers on using
> > > request chaining for a batching implementation for software compressors.
> >
> > Is there a functional change in doing so, or just using different
> > interfaces to accomplish the same thing we do today?
>
> The code paths for software compressors are considerably different between
> these two scenarios:
>
> 1) Given a batch of 8 pages: for each page, call zswap_compress() that does this:
>
>         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
>
> 2) Given a batch of 8 pages:
>      a) Create a request chain of 8 acomp_reqs, starting with reqs[0], as
>          described earlier.
>      b) Process the request chain by calling:
>
>               err = crypto_wait_req(acomp_do_req_chain(reqs[0], crypto_acomp_compress), &acomp_ctx->wait);
>         /* Get each req's error status. */
>         for (i = 0; i < nr_pages; ++i) {
>                 errors[i] = acomp_request_err(reqs[i]);
>                 if (errors[i]) {
>                         pr_debug("Request chaining req %d compress error %d\n", i, errors[i]);
>                 } else {
>                         dlens[i] = reqs[i]->dlen;
>                 }
>         }
>
> What I mean by considerably different code paths is that request chaining
> internally overwrites the req's base.complete and base.data (after saving the
> original values) to implement the algorithm described earlier. Basically, the
> chain is processed in series by getting the next req in the chain, setting it's
> completion function to "acomp_reqchain_done()", which gets called when
> the "op" (crypto_acomp_compress()) is completed for that req.
> acomp_reqchain_done() will cause the next req to be processed in the
> same manner. If this next req happens to be the last req to be processed,
> it will notify the original completion function of reqs[0], with the crypto_wait
> that zswap sets up in zswap_cpu_comp_prepare():
>
>         acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
>                                    crypto_req_done, &acomp_ctx->wait);
>
> Patch [1] in v5 of this series has the full implementation of acomp_do_req_chain()
> in case you want to understand this in more detail.
>
> The "functional change" wrt request chaining is limited to the above.

For software compressors, the batch size should be 1. In that
scenario, from a zswap perspective (without going into the acomp
implementation details please), is there a functional difference? If
not, we can just use the request chaining API regardless of batching
if that is what Herbert means.

>
> [1]: https://patchwork.kernel.org/project/linux-mm/patch/20241221063119.29140-2-kanchana.p.sridhar@xxxxxxxxx/
>
> Thanks,
> Kanchana
>