> -----Original Message----- > From: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Sent: Tuesday, December 3, 2024 2:18 PM > To: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Cc: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>; Nhat Pham > <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; chengming.zhou@xxxxxxxxx; > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; > akpm@xxxxxxxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>; Sridhar, Kanchana P > <kanchana.p.sridhar@xxxxxxxxx> > Subject: RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if > the crypto_alg supports batching. > > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > Sent: Tuesday, December 3, 2024 1:44 PM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > > Cc: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>; Nhat Pham > > <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux- > mm@xxxxxxxxx; > > hannes@xxxxxxxxxxx; chengming.zhou@xxxxxxxxx; > > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; ying.huang@xxxxxxxxx; > > 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux- > > crypto@xxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; > > ardb@xxxxxxxxxx; ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, > > Kristen C <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K > > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources > if > > the crypto_alg supports batching. > > > > On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P > > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > > > > > > > -----Original Message----- > > > > From: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> > > > > Sent: Tuesday, December 3, 2024 12:01 AM > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > > > > Cc: Nhat Pham <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > > linux- > > > > mm@xxxxxxxxx; hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; > > > > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > > > > ryan.roberts@xxxxxxx; ying.huang@xxxxxxxxx; 21cnbao@xxxxxxxxx; > > > > akpm@xxxxxxxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; > > > > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; > > > > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C > > > > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K > > <wajdi.k.feghali@xxxxxxxxx>; > > > > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > > > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching > > resources if > > > > the crypto_alg supports batching. > > > > > > > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote: > > > > > > > > > > > Why do we need this "can_batch" field? IIUC, this can be > determined > > > > > > from the compressor internal fields itself, no? > > > > > > > > > > > > acomp_has_async_batching(acomp); > > > > > > > > > > > > Is this just for convenience, or is this actually an expensive thing to > > > > compute? > > > > > > > > > > Thanks for your comments. This is a good question. I tried not to imply > > that > > > > > batching resources have been allocated for the cpu based only on what > > > > > acomp_has_async_batching() returns. It is possible that the cpu > onlining > > > > > code ran into an -ENOMEM error on any particular cpu. In this case, I > > set > > > > > the pool->can_batch to "false", mainly for convenience, so that zswap > > > > > can be somewhat insulated from migration. I agree that this may not > be > > > > > the best solution; and whether or not batching is enabled can be > directly > > > > > determined just before the call to crypto_acomp_batch_compress() > > > > > based on: > > > > > > > > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE; > > > > > > > > With ahash request chaining, the idea is to accumulate as much > > > > data as you can before you provide it to the Crypto API. The > > > > API is responsible for dividing it up if the underlying driver > > > > is only able to handle one request at a time. > > > > > > > > So that would be the ideal model to use for compression as well. > > > > Provide as much data as you can and let the API handle the case > > > > where the data needs to be divided up. > > > > > > Thanks for this suggestion! This sounds like a clean way to handle the > > > batching/sequential compress/decompress within the crypto API as long > > > as it can be contained in the crypto acompress layer. > > > If the zswap maintainers don't have any objections, I can look into the > > > feasibility of doing this. > > > > Does this mean that instead of zswap breaking down the folio into > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the > > crypto layer and let it do the batching as it pleases? > > If I understand Herbert's suggestion correctly, I think what he meant was > that we allocate only SWAP_CRYPTO_BATCH_SIZE # of buffers in zswap (say, > 8) > during the cpu onlining always. The acomp_has_async_batching() API can > be used to determine whether to allocate more than one acomp_req and > crypto_wait (fyi, I am creating SWAP_CRYPTO_BATCH_SIZE # of crypto_wait > for the request chaining with the goal of understanding performance wrt the > existing implementation of crypto_acomp_batch_compress()). > In zswap_store_folio(), we process the large folio in batches of 8 pages > and call "crypto_acomp_batch_compress()" for each batch. Based on earlier > discussions in this thread, it might make sense to add a bool option to > crypto_acomp_batch_compress() as follows: > > static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[], > struct crypto_wait *waits[], > struct page *pages[], > u8 *dsts[], > unsigned int dlens[], > int errors[], > int nr_pages, > bool parallel); > > zswap would acquire the per-cpu acomp_ctx->mutex, and pass > (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) for the "parallel" > parameter. > This indicates to crypto_acomp_batch_compress() whether or not > SWAP_CRYPTO_BATCH_SIZE # of elements are available in "reqs" and > "waits". > > If we have multiple "reqs" (parallel == true), we use request chaining (or the > existing asynchronous poll implementation) for IAA batching. If (parallel == > false), > crypto_acomp_batch_compress() will look something like this: > > static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[], > struct crypto_wait *waits[], > struct page *pages[], > u8 *dsts[], > unsigned int dlens[], > int errors[], > int nr_pages, > bool parallel) > { > if (!parallel) { > struct scatterlist input, output; > int i; > > for (i = 0; i < nr_pages; ++i) { > /* for pages[i], buffers[i], dlens[i]: borrow first half of > * zswap_compress() functionality: > */ > dst = acomp_ctx->buffers[i]; > sg_init_table(&input, 1); > sg_set_page(&input, pages[i], PAGE_SIZE, 0); > > sg_init_one(&output, dst, PAGE_SIZE * 2); > acomp_request_set_params(acomp_ctx->reqs[0], > &input, &output, PAGE_SIZE, dlens[i]); > > comp_ret = > crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), acomp_ctx- > >waits[0]); > dlens[i] = acomp_ctx->reqs[0]->dlen; > } > } > > /* > * At this point we would have sequentially compressed the batch. > * zswap_store_folio() can process the buffers and dlens using > * common code for batching and non-batching compressors. > */ > } > > IIUC, this suggestion appears to be along the lines of using common > code in zswap as far as possible, for compressors that do and do not > support batching. Herbert can correct me if I am wrong. > > If this is indeed the case, the memory penalty for software compressors > would be: > 1) pre-allocating SWAP_CRYPTO_BATCH_SIZE acomp_ctx->buffers in > zswap_cpu_comp_prepare(). > 2) SWAP_CRYPTO_BATCH_SIZE stack variables for pages and dlens in > zswap_store_folio(). > > This would be an additional memory penalty for what we gain by > having the common code paths in zswap for compressors that do > and do not support batching. Alternately, we could use request chaining always, even for software compressors for a larger memory penalty per-cpu, by allocating SWAP_CRYPTO_BATCH_SIZE # of reqs/waits by default. I don't know if this would have functional issues because the chain of requests will be processed sequentially (basically all requests are added to a list), but maybe Herbert is suggesting this (not sure). Thanks, Kanchana > > Thanks, > Kanchana > > > > > It sounds nice on the surface, but this implies that we have to > > allocate folio_nr_pages() buffers in zswap, essentially as the > > allocation is the same size as the folio itself. While the allocation > > does not need to be contiguous, making a large number of allocations > > in the reclaim path is definitely not something we want. For a 2M THP, > > we'd need to allocate 2M in zswap_store(). > > > > If we choose to keep preallocating, assuming the maximum THP size is > > 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of > > memory. > > > > I feel like I am missing something. > > > > > > > > Thanks, > > > Kanchana > > > > > > > > > > > Cheers, > > > > -- > > > > Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> > > > > Home Page: http://gondor.apana.org.au/~herbert/ > > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt