RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Tue, 3 Dec 2024 22:24:17 +0000

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Sent: Tuesday, December 3, 2024 2:18 PM
> To: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Cc: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>; Nhat Pham
> <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> akpm@xxxxxxxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>;
> Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>; Sridhar, Kanchana P
> <kanchana.p.sridhar@xxxxxxxxx>
> Subject: RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > Sent: Tuesday, December 3, 2024 1:44 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > Cc: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>; Nhat Pham
> > <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-
> mm@xxxxxxxxx;
> > hannes@xxxxxxxxxxx; chengming.zhou@xxxxxxxxx;
> > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; ying.huang@xxxxxxxxx;
> > 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux-
> > crypto@xxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx;
> > ardb@xxxxxxxxxx; ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi,
> > Kristen C <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K
> > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources
> if
> > the crypto_alg supports batching.
> >
> > On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@xxxxxxxxx> wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> > > > Sent: Tuesday, December 3, 2024 12:01 AM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > > > Cc: Nhat Pham <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
> > linux-
> > > > mm@xxxxxxxxx; hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx;
> > > > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx;
> > > > ryan.roberts@xxxxxxx; ying.huang@xxxxxxxxx; 21cnbao@xxxxxxxxx;
> > > > akpm@xxxxxxxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx;
> > > > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> > > > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> > > > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K
> > <wajdi.k.feghali@xxxxxxxxx>;
> > > > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching
> > resources if
> > > > the crypto_alg supports batching.
> > > >
> > > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> > > > >
> > > > > > Why do we need this "can_batch" field? IIUC, this can be
> determined
> > > > > > from the compressor internal fields itself, no?
> > > > > >
> > > > > > acomp_has_async_batching(acomp);
> > > > > >
> > > > > > Is this just for convenience, or is this actually an expensive thing to
> > > > compute?
> > > > >
> > > > > Thanks for your comments. This is a good question. I tried not to imply
> > that
> > > > > batching resources have been allocated for the cpu based only on what
> > > > > acomp_has_async_batching() returns. It is possible that the cpu
> onlining
> > > > > code ran into an -ENOMEM error on any particular cpu. In this case, I
> > set
> > > > > the pool->can_batch to "false", mainly for convenience, so that zswap
> > > > > can be somewhat insulated from migration. I agree that this may not
> be
> > > > > the best solution; and whether or not batching is enabled can be
> directly
> > > > > determined just before the call to crypto_acomp_batch_compress()
> > > > > based on:
> > > > >
> > > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> > > >
> > > > With ahash request chaining, the idea is to accumulate as much
> > > > data as you can before you provide it to the Crypto API.  The
> > > > API is responsible for dividing it up if the underlying driver
> > > > is only able to handle one request at a time.
> > > >
> > > > So that would be the ideal model to use for compression as well.
> > > > Provide as much data as you can and let the API handle the case
> > > > where the data needs to be divided up.
> > >
> > > Thanks for this suggestion! This sounds like a clean way to handle the
> > > batching/sequential compress/decompress within the crypto API as long
> > > as it can be contained in the crypto acompress layer.
> > > If the zswap maintainers don't have any objections, I can look into the
> > > feasibility of doing this.
> >
> > Does this mean that instead of zswap breaking down the folio into
> > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> > crypto layer and let it do the batching as it pleases?
> 
> If I understand Herbert's suggestion correctly, I think what he meant was
> that we allocate only SWAP_CRYPTO_BATCH_SIZE # of buffers in zswap (say,
> 8)
> during the cpu onlining always. The acomp_has_async_batching() API can
> be used to determine whether to allocate more than one acomp_req and
> crypto_wait (fyi, I am creating SWAP_CRYPTO_BATCH_SIZE # of crypto_wait
> for the request chaining with the goal of understanding performance wrt the
> existing implementation of crypto_acomp_batch_compress()).
> In zswap_store_folio(), we process the large folio in batches of 8 pages
> and call "crypto_acomp_batch_compress()" for each batch. Based on earlier
> discussions in this thread, it might make sense to add a bool option to
> crypto_acomp_batch_compress() as follows:
> 
> static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
> 					       struct crypto_wait *waits[],
> 					       struct page *pages[],
> 					       u8 *dsts[],
> 					       unsigned int dlens[],
> 					       int errors[],
> 					       int nr_pages,
> 					       bool parallel);
> 
> zswap would acquire the per-cpu acomp_ctx->mutex, and pass
> (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) for the "parallel"
> parameter.
> This indicates to crypto_acomp_batch_compress() whether or not
> SWAP_CRYPTO_BATCH_SIZE # of elements are available in "reqs" and
> "waits".
> 
> If we have multiple "reqs" (parallel == true), we use request chaining (or the
> existing asynchronous poll implementation) for IAA batching. If (parallel ==
> false),
> crypto_acomp_batch_compress() will look something like this:
> 
> static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
> 					       struct crypto_wait *waits[],
> 					       struct page *pages[],
> 					       u8 *dsts[],
> 					       unsigned int dlens[],
> 					       int errors[],
> 					       int nr_pages,
> 					       bool parallel)
> {
> 	if (!parallel) {
> 		struct scatterlist input, output;
> 		int i;
> 
> 		for (i = 0; i < nr_pages; ++i) {
> 			/* for pages[i], buffers[i], dlens[i]: borrow first half of
> 			 * zswap_compress() functionality:
> 			*/
> 			dst = acomp_ctx->buffers[i];
> 			sg_init_table(&input, 1);
> 			sg_set_page(&input, pages[i], PAGE_SIZE, 0);
> 
> 			sg_init_one(&output, dst, PAGE_SIZE * 2);
> 			acomp_request_set_params(acomp_ctx->reqs[0],
> &input, &output, PAGE_SIZE, dlens[i]);
> 
> 			comp_ret =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), acomp_ctx-
> >waits[0]);
> 			dlens[i] = acomp_ctx->reqs[0]->dlen;
> 		}
> 	}
> 
> 	/*
> 	 * At this point we would have sequentially compressed the batch.
> 	 * zswap_store_folio() can process the buffers and dlens using
> 	 * common code for batching and non-batching compressors.
> 	*/
> }
> 
> IIUC, this suggestion appears to be along the lines of using common
> code in zswap as far as possible, for compressors that do and do not
> support batching. Herbert can correct me if I am wrong.
> 
> If this is indeed the case, the memory penalty for software compressors
> would be:
> 1) pre-allocating SWAP_CRYPTO_BATCH_SIZE acomp_ctx->buffers in
>     zswap_cpu_comp_prepare().
> 2) SWAP_CRYPTO_BATCH_SIZE stack variables for pages and dlens in
>     zswap_store_folio().
> 
> This would be an additional memory penalty for what we gain by
> having the common code paths in zswap for compressors that do
> and do not support batching.

Alternately, we could use request chaining always, even for software
compressors for a larger memory penalty per-cpu, by allocating
SWAP_CRYPTO_BATCH_SIZE # of reqs/waits by default. I don't know
if this would have functional issues because the chain of requests
will be processed sequentially (basically all requests are added to a
list), but maybe Herbert is suggesting this (not sure).

Thanks,
Kanchana

> 
> Thanks,
> Kanchana
> 
> >
> > It sounds nice on the surface, but this implies that we have to
> > allocate folio_nr_pages() buffers in zswap, essentially as the
> > allocation is the same size as the folio itself. While the allocation
> > does not need to be contiguous, making a large number of allocations
> > in the reclaim path is definitely not something we want. For a 2M THP,
> > we'd need to allocate 2M in zswap_store().
> >
> > If we choose to keep preallocating, assuming the maximum THP size is
> > 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of
> > memory.
> >
> > I feel like I am missing something.
> >
> > >
> > > Thanks,
> > > Kanchana
> > >
> > > >
> > > > Cheers,
> > > > --
> > > > Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> > > > Home Page: http://gondor.apana.org.au/~herbert/
> > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt