On Fri, Dec 20, 2024 at 10:31 PM Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > IAA Compression Batching with acomp Request Chaining: > ===================================================== > > This patch-series introduces the use of the Intel Analytics Accelerator > (IAA) for parallel batch compression of pages in large folios to improve > zswap swapout latency, resulting in sys time reduction by 22% (usemem30) > and by 27% (kernel compilation); as well as a 30% increase in usemem30 > throughput with IAA batching as compared to zstd. > > The patch-series is organized as follows: > > 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant > patches are tagged with "crypto:" in the subject: > > Patch 1) Adds new acomp request chaining framework and interface based > on Herbert Xu's ahash reference implementation in "[PATCH 2/6] > crypto: hash - Add request chaining API" [1]. acomp algorithms > can use request chaining through these interfaces: > > Setup the request chain: > acomp_reqchain_init() > acomp_request_chain() > > Process the request chain: > acomp_do_req_chain(): synchronously (sequentially) > acomp_do_async_req_chain(): asynchronously using submit/poll > ops (in parallel) > > Patch 2) Adds acomp_alg/crypto_acomp interfaces for batch_compress(), > batch_decompress() and get_batch_size(), that swap modules can > invoke using the new batching API crypto_acomp_batch_compress(), > crypto_acomp_batch_decompress() and crypto_acomp_batch_size(). > Additionally, crypto acomp provides a new > acomp_has_async_batching() interface to query for these API > before allocating batching resources for a given compressor in > zswap/zram. > > Patch 3) New CRYPTO_ACOMP_REQ_POLL acomp_req flag to act as a gate for > async poll mode in iaa_crypto. > > Patch 4) iaa-crypto driver implementations for sync/async > crypto_acomp_batch_compress() and > crypto_acomp_batch_decompress() developed using request > chaining. If the iaa_crypto driver is set up for 'async' > sync_mode, these batching implementations deploy the > asynchronous request chaining implementation. 'async' is the > recommended mode for realizing the benefits of IAA parallelism. > If iaa_crypto is set up for 'sync' sync_mode, the synchronous > version of the request chaining API is used. > > The "iaa_acomp_fixed_deflate" algorithm registers these > implementations for its "batch_compress" and "batch_decompress" > interfaces respectively and opts in with CRYPTO_ALG_REQ_CHAIN. > Further, iaa_crypto provides an implementation for the > "get_batch_size" interface: this returns the > IAA_CRYPTO_MAX_BATCH_SIZE constant that iaa_crypto defines > currently as 8U for IAA compression algorithms (iaa_crypto can > change this if needed as we optimize our batching algorithm). > > Patch 5) Modifies the default iaa_crypto driver mode to async, now that > iaa_crypto provides a truly async mode that gives > significantly better latency than sync mode for the batching > use case. > > Patch 6) Disables verify_compress by default, to facilitate users to > run IAA easily for comparison with software compressors. > > Patch 7) Reorganizes the iaa_crypto driver code into logically related > sections and avoids forward declarations, in order to facilitate > Patch 8. This patch makes no functional changes. > > Patch 8) Makes a major infrastructure change in the iaa_crypto driver, > to map IAA devices/work-queues to cores based on packages > instead of NUMA nodes. This doesn't impact performance on > the Sapphire Rapids system used for performance > testing. However, this change fixes functional problems we > found on Granite Rapids in internal validation, where the > number of NUMA nodes is greater than the number of packages, > which was resulting in over-utilization of some IAA devices > and non-usage of other IAA devices as per the current NUMA > based mapping infrastructure. > This patch also eliminates duplication of device wqs in > per-cpu wq_tables, thereby saving 140MiB on a 384 cores > Granite Rapids server with 8 IAAs. Submitting this change now > so that it can go through code reviews before it can be merged. > > Patch 9) Builds upon the new infrastructure for mapping IAAs to cores > based on packages, and enables configuring a "global_wq" per > IAA, which can be used as a global resource for compress jobs > for the package. If the user configures 2WQs per IAA device, > the driver will distribute compress jobs from all cores on the > package to the "global_wqs" of all the IAA devices on that > package, in a round-robin manner. This can be used to improve > compression throughput for workloads that see a lot of swapout > activity. > > 2) zswap modifications to enable compress batching in zswap_store() > of large folios (including pmd-mappable folios): > > Patch 10) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set > as 8U) to denote the maximum number of acomp_ctx batching > resources. Further, the "struct crypto_acomp_ctx" is modified > to contain a configurable number of acomp_reqs and buffers. > The cpu hotplug onlining code will query > acomp_has_async_batching() and if this returns "true", will > further get the compressor defined maximum batch size, and > will use the minimum of zswap's upper limit and the > compressor's maximum batch size to allocate > acomp_reqs/buffers if the acomp supports batching, and 1 > acomp_req/buffer if not. > > Patch 11) Restructures & simplifies zswap_store() to make it amenable > for batching. Moves the loop over the folio's pages to a new > zswap_store_folio(), which in turn allocates zswap entries > for all folio pages upfront, before proceeding to call a > newly added zswap_compress_folio(), which simply calls > zswap_compress() for each folio page. > > Patch 12) Finally, this patch modifies zswap_compress_folio() to detect > if the pool's acomp_ctx has batching resources. If so, the > "acomp_ctx->nr_reqs" becomes the batch size to use to call > crypto_acomp_batch_compress() for every "acomp_ctx->nr_reqs" > pages in the large folio. The crypto API calls into the new > iaa_crypto "iaa_comp_acompress_batch()" that does batching > with request chaining. Upon successful compression of a > batch, the compressed buffers are stored in zpool. > > With v5 of this patch series, the IAA compress batching feature will be > enabled seamlessly on Intel platforms that have IAA by selecting > 'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async' > sync_mode driver attribute. > > [1]: https://lore.kernel.org/linux-crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herbert@xxxxxxxxxxxxxxxxxxx/ > > > System setup for testing: > ========================= > Testing of this patch-series was done with mm-unstable as of 12-20-2024, > commit 5555a83c82d6, without and with this patch-series. > Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores > per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk > partition swap. Core frequency was fixed at 2500MHz. > > Other kernel configuration parameters: > > zswap compressor : zstd, deflate-iaa > zswap allocator : zsmalloc > vm.page-cluster : 0, 2 > > IAA "compression verification" is disabled and IAA is run in the async > mode (the defaults with this series). 2WQs are configured per IAA > device. Compress jobs from all cores on a socket are distributed among all > 4 IAA devices on the same socket. > > I ran experiments with these workloads: > > 1) usemem 30 processes with these large folios enabled to "always": > - 16k/32k/64k > - 2048k > > 2) Kernel compilation allmodconfig with 2G max memory, 32 threads, run in > tmpfs with these large folios enabled to "always": > - 16k/32k/64k > > > IAA compress batching performance: sync vs. async request chaining: > =================================================================== > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem > processes were run, each allocating and writing 10G of memory, and sleeping > for 10 sec before exiting: > > usemem --init-time -w -O -s 10 -n 30 10g > > "async polling" here refers to the v4 implementation of batch compression > without request chaining, which is used as baseline to compare the request > chaining implementations in v5. > > These are the latencies measured using bcc profiling with bpftrace for the > various iaa_crypto modes: > > ------------------------------------------------------------------------------- > usemem30: 16k/32k/64k Folios crypto_acomp_batch_compress() latency > > iaa_crypto batching count mean p50 p99 > implementation (ns) (ns) (ns) > ------------------------------------------------------------------------------- > > async polling 5,210,702 10,083 9,675 17,488 > > sync request chaining 5,396,532 33,391 32,977 39,426 > > async request chaining 5,509,777 9,959 9,611 16,590 > > ------------------------------------------------------------------------------- > > This demonstrates that async request chaining doesn't cause IAA compress > batching performance regression wrt the v4 implementation without request > chaining. > > > Performance testing (usemem30): > =============================== > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem > processes were run, each allocating and writing 10G of memory, and sleeping > for 10 sec before exiting: > > usemem --init-time -w -O -s 10 -n 30 10g > > > 16k/32/64k folios: usemem30: zstd: > ================================== > > ------------------------------------------------------------------------------- > mm-unstable-12-20-2024 v5 of this patch-series > ------------------------------------------------------------------------------- > zswap compressor zstd zstd > vm.page-cluster 2 2 > > ------------------------------------------------------------------------------- > Total throughput (KB/s) 6,143,774 6,180,657 > Avg throughput (KB/s) 204,792 206,021 > elapsed time (sec) 110.45 112.02 > sys time (sec) 2,628.55 2,684.53 > > ------------------------------------------------------------------------------- > memcg_high 469,269 481,665 > memcg_swap_fail 1,198 910 > zswpout 48,932,319 48,931,447 > zswpin 384 398 > pswpout 0 0 > pswpin 0 0 > thp_swpout 0 0 > thp_swpout_fallback 0 0 > 16kB-swpout_fallback 0 0 > 32kB_swpout_fallback 0 0 > 64kB_swpout_fallback 1,198 910 > pgmajfault 3,459 3,090 > swap_ra 96 100 > swap_ra_hit 48 54 > ZSWPOUT-16kB 2 2 > ZSWPOUT-32kB 2 0 > ZSWPOUT-64kB 3,057,060 3,057,286 > SWPOUT-16kB 0 0 > SWPOUT-32kB 0 0 > SWPOUT-64kB 0 0 > ------------------------------------------------------------------------------- > > > 16k/32/64k folios: usemem30: deflate-iaa: > ========================================= > > ------------------------------------------------------------------------------- > mm-unstable-12-20-2024 v5 of this patch-series > ------------------------------------------------------------------------------- > zswap compressor deflate-iaa deflate-iaa IAA Batching > vm.page-cluster 2 2 vs. vs. > Seq zstd > ------------------------------------------------------------------------------- > Total throughput (KB/s) 7,679,064 8,027,314 5% 30% > Avg throughput (KB/s) 255,968 267,577 5% 30% > elapsed time (sec) 90.82 87.53 -4% -22% > sys time (sec) 2,205.73 2,099.80 -5% -22% > > ------------------------------------------------------------------------------- > memcg_high 716,670 722,693 > memcg_swap_fail 1,187 1,251 > zswpout 64,511,695 64,510,499 > zswpin 483 477 > pswpout 0 0 > pswpin 0 0 > thp_swpout 0 0 > thp_swpout_fallback 0 0 > 16kB-swpout_fallback 0 0 > 32kB_swpout_fallback 0 0 > 64kB_swpout_fallback 1,187 1,251 > pgmajfault 3,180 3,187 > swap_ra 175 155 > swap_ra_hit 114 76 > ZSWPOUT-16kB 5 3 > ZSWPOUT-32kB 1 2 > ZSWPOUT-64kB 4,030,709 4,030,573 > SWPOUT-16kB 0 0 > SWPOUT-32kB 0 0 > SWPOUT-64kB 0 0 > ------------------------------------------------------------------------------- > > > 2M folios: usemem30: zstd: > ========================== > > ------------------------------------------------------------------------------- > mm-unstable-12-20-2024 v5 of this patch-series > ------------------------------------------------------------------------------- > zswap compressor zstd zstd > vm.page-cluster 2 2 > > ------------------------------------------------------------------------------- > Total throughput (KB/s) 6,643,427 6,534,525 > Avg throughput (KB/s) 221,447 217,817 > elapsed time (sec) 102.92 104.44 > sys time (sec) 2,332.67 2,415.00 > Seems like there's a regression for zstd here? > ------------------------------------------------------------------------------- > memcg_high 61,999 60,770 > memcg_swap_fail 37 47 > zswpout 48,934,491 48,934,952 > zswpin 386 404 > pswpout 0 0 > pswpin 0 0 > thp_swpout 0 0 > thp_swpout_fallback 37 47 > pgmajfault 5,010 4,646 > swap_ra 5,836 4,692 > swap_ra_hit 5,790 4,640 > ZSWPOUT-2048kB 95,529 95,520 > SWPOUT-2048kB 0 0 > ------------------------------------------------------------------------------- > > > 2M folios: usemem30: deflate-iaa: > ================================= > > ------------------------------------------------------------------------------- > mm-unstable-12-20-2024 v5 of this patch-series > ------------------------------------------------------------------------------- > zswap compressor deflate-iaa deflate-iaa IAA Batching > vm.page-cluster 2 2 vs. vs. > Seq zstd > ------------------------------------------------------------------------------- > Total throughput (KB/s) 8,197,457 8,427,981 3% 29% > Avg throughput (KB/s) 273,248 280,932 3% 29% > elapsed time (sec) 86.79 83.45 -4% -20% > sys time (sec) 2,044.02 1,925.84 -6% -20% > > ------------------------------------------------------------------------------- > memcg_high 94,008 88,809 > memcg_swap_fail 50 57 > zswpout 64,521,910 64,520,405 > zswpin 421 452 > pswpout 0 0 > pswpin 0 0 > thp_swpout 0 0 > thp_swpout_fallback 50 57 > pgmajfault 9,658 8,958 > swap_ra 19,633 17,341 > swap_ra_hit 19,579 17,278 > ZSWPOUT-2048kB 125,916 125,913 > SWPOUT-2048kB 0 0 > ------------------------------------------------------------------------------- > > > Performance testing (Kernel compilation, allmodconfig): > ======================================================= > > The experiments with kernel compilation test, 32 threads, in tmpfs use the > "allmodconfig" that takes ~12 minutes, and has considerable swapout > activity. The cgroup's memory.max is set to 2G. > > > 16k/32k/64k folios: Kernel compilation/allmodconfig: > ==================================================== > w/o: mm-unstable-12-20-2024 > > ------------------------------------------------------------------------------- > w/o v5 w/o v5 > ------------------------------------------------------------------------------- > zswap compressor zstd zstd deflate-iaa deflate-iaa > vm.page-cluster 0 0 0 0 > > ------------------------------------------------------------------------------- > real_sec 792.04 793.92 783.43 766.93 > user_sec 15,781.73 15,772.48 15,753.22 15,766.53 > sys_sec 5,302.83 5,308.05 3,982.30 3,853.21 > ------------------------------------------------------------------------------- > Max_Res_Set_Size_KB 1,871,908 1,873,368 1,871,836 1,873,168 > ------------------------------------------------------------------------------- > memcg_high 0 0 0 0 > memcg_swap_fail 0 0 0 0 > zswpout 90,775,917 91,653,816 106,964,482 110,380,500 > zswpin 26,099,486 26,611,908 31,598,420 32,618,221 > pswpout 48 96 331 331 > pswpin 48 89 320 310 > thp_swpout 0 0 0 0 > thp_swpout_fallback 0 0 0 0 > 16kB_swpout_fallback 0 0 0 0 > 32kB_swpout_fallback 0 0 0 0 > 64kB_swpout_fallback 0 2,337 7,943 5,512 > pgmajfault 27,858,798 28,438,518 33,970,455 34,999,918 > swap_ra 0 0 0 0 > swap_ra_hit 2,173 2,913 2,192 5,248 > ZSWPOUT-16kB 1,292,865 1,306,214 1,463,397 1,483,056 > ZSWPOUT-32kB 695,446 705,451 830,676 829,992 > ZSWPOUT-64kB 2,938,716 2,958,250 3,520,199 3,634,972 > SWPOUT-16kB 0 0 0 0 > SWPOUT-32kB 0 0 0 0 > SWPOUT-64kB 3 6 20 19 > ------------------------------------------------------------------------------- > > > > Summary: > ======== > The performance testing data with usemem 30 processes and kernel > compilation test show 30% throughput gains and 22% sys time reduction > (usemem30) and 27% sys time reduction (kernel compilation) with > zswap_store() large folios using IAA compress batching as compared to > zstd. These improvements seem to mostly come from the change from zstd to deflate-iaa, not from batching (which is the main purpose of the patch series). The gains from the batching seem to be rather low and definitely below what I expected. We should be compressing 8 pages in parallel instead of in series, so why are the gains from the batching marginal? I think the comparison with zstd is rather confusing. Testing with zstd before and after the series to check for regressions is good, but comparing zstd vs IAA with batching is rather confusing/misleading. The impact of this series should be measured by IAA with and without batching. > > The iaa_crypto wq stats will show almost the same number of compress calls > for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively. > We see a latency reduction of 2.5% by distributing compress jobs among all > IAA devices on the socket (based on v1 data). > > We can expect to see even more significant performance and throughput > improvements if we use the parallelism offered by IAA to do reclaim > batching of 4K/large folios (really any-order folios), and using the > zswap_store() high throughput compression to batch-compress pages > comprising these folios, not just batching within large folios. This is the > reclaim batching patch 13 in v1, which will be submitted in a separate > patch-series. > > Our internal validation of IAA compress/decompress batching in highly > contended Sapphire Rapids server setups with workloads running on 72 cores > for ~25 minutes under stringent memory limit constraints have shown up to > 50% reduction in sys time and 3.5% reduction in workload run time as > compared to software compressors. > > > Changes since v4: > ================= > 1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6. > 2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert! > 3) Implemented IAA compress batching using request chaining. > 4) zswap_store() batching simplifications suggested by Chengming, Yosry and > Nhat, thanks to all! > - New zswap_compress_folio() that is called by zswap_store(). > - Move the loop over folio's pages out of zswap_store() and into a > zswap_store_folio() that stores all pages. > - Allocate all zswap entries for the folio upfront. > - Added zswap_batch_compress(). > - Branch to call zswap_compress() or zswap_batch_compress() inside > zswap_compress_folio(). > - All iterations over pages kept in same function level. > - No helpers other than the newly added zswap_store_folio() and > zswap_compress_folio(). > > > Changes since v3: > ================= > 1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6. > 2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores, > based on packages instead of NUMA nodes. > 3) Added acomp_has_async_batching() API to crypto acomp, that allows > zswap/zram to query if a crypto_acomp has registered batch_compress and > batch_decompress interfaces. > 4) Clear the poll bits on the acomp_reqs passed to > iaa_comp_a[de]compress_batch() so that a module like zswap can be > confident about the acomp_reqs[0] not having the poll bit set before > calling the fully synchronous API crypto_acomp_[de]compress(). > Herbert, I would appreciate it if you can review changes 2-4; in patches > 1-8 in v4. I did not want to introduce too many iaa_crypto changes in > v4, given that patch 7 is already making a major change. I plan to work > on incorporating the request chaining using the ahash interface in v5 > (I need to understand the basic crypto ahash better). Thanks Herbert! > 5) Incorporated Johannes' suggestion to not have a sysctl to enable > compress batching. > 6) Incorporated Yosry's suggestion to allocate batching resources in the > cpu hotplug onlining code, since there is no longer a sysctl to control > batching. Thanks Yosry! > 7) Incorporated Johannes' suggestions related to making the overall > sequence of events between zswap_store() and zswap_batch_store() similar > as much as possible for readability and control flow, better naming of > procedures, avoiding forward declarations, not inlining error path > procedures, deleting zswap internal details from zswap.h, etc. Thanks > Johannes, really appreciate the direction! > I have tried to explain the minimal future-proofing in terms of the > zswap_batch_store() signature and the definition of "struct > zswap_batch_store_sub_batch" in the comments for this struct. I hope the > new code explains the control flow a bit better. > > > Changes since v2: > ================= > 1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8. > 2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL > returned by kmalloc_node() for acomp_ctx->buffers and for > acomp_ctx->reqs. > 3) Fixed a bug in zswap_pool_can_batch() for returning true if > pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if > the per-cpu acomp_batch_ctx tests true for batching resources having > been allocated on this cpu. Also, changed from per_cpu_ptr() to > raw_cpu_ptr(). > 4) Incorporated the zswap_store_propagate_errors() compilation warning fix > suggested by Dan Carpenter. Thanks Dan! > 5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in > zswap.h, with SWAP_CRYPTO_BATCH_SIZE. > > Changes since v1: > ================= > 1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702. > 2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate > async/poll mode, and to encapsulate the polling functionality in the > iaa_crypto driver. Thanks Herbert! > 3) Incorporated Herbert's and Yosry's suggestions to implement the batching > API in iaa_crypto and to make its use seamless from zswap's > perspective. Thanks Herbert and Yosry! > 4) Incorporated Yosry's suggestion to make it more convenient for the user > to enable compress batching, while minimizing the memory footprint > cost. Thanks Yosry! > 5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list() > reclaim batching patch from this series, since it requires a broader > discussion. > > > I would greatly appreciate code review comments for the iaa_crypto driver > and mm patches included in this series! > > Thanks, > Kanchana > > > > Kanchana P Sridhar (12): > crypto: acomp - Add synchronous/asynchronous acomp request chaining. > crypto: acomp - Define new interfaces for compress/decompress > batching. > crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable > async mode. > crypto: iaa - Implement batch_compress(), batch_decompress() API in > iaa_crypto. > crypto: iaa - Make async mode the default. > crypto: iaa - Disable iaa_verify_compress by default. > crypto: iaa - Re-organize the iaa_crypto driver code. > crypto: iaa - Map IAA devices/wqs to cores based on packages instead > of NUMA. > crypto: iaa - Distribute compress jobs from all cores to all IAAs on a > package. > mm: zswap: Allocate pool batching resources if the crypto_alg supports > batching. > mm: zswap: Restructure & simplify zswap_store() to make it amenable > for batching. > mm: zswap: Compress batching with Intel IAA in zswap_store() of large > folios. > > crypto/acompress.c | 287 ++++ > drivers/crypto/intel/iaa/iaa_crypto.h | 27 +- > drivers/crypto/intel/iaa/iaa_crypto_main.c | 1697 +++++++++++++++----- > include/crypto/acompress.h | 157 ++ > include/crypto/algapi.h | 10 + > include/crypto/internal/acompress.h | 29 + > include/linux/crypto.h | 31 + > mm/zswap.c | 406 +++-- > 8 files changed, 2103 insertions(+), 541 deletions(-) > > > base-commit: 5555a83c82d66729e4abaf16ae28d6bd81f9a64a > -- > 2.27.0 >