From: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx> Subject: zram: switch to crypto compress API We don't have an idle zstreams list anymore and our write path now works absolutely differently, preventing preemption during compression. This removes possibilities of read paths preempting writes at wrong places (which could badly affect the performance of both paths) and at the same time opens the door for a move from custom LZO/LZ4 compression backends implementation to a more generic one, using crypto compress API. Joonsoo Kim [1] attempted to do this a while ago, but faced with the need of introducing a new crypto API interface. The root cause was the fact that crypto API compression algorithms require a compression stream structure (in zram terminology) for both compression and decompression ops, while in reality only several of compression algorithms really need it. This resulted in a concept of context-less crypto API compression backends [2]. Both write and read paths, though, would have been executed with the preemption enabled, which in the worst case could have resulted in a decreased worst-case performance, e.g. consider the following case: CPU0 zram_write() spin_lock() take the last idle stream spin_unlock() << preempted >> zram_read() spin_lock() no idle streams spin_unlock() schedule() resuming zram_write compression() but it took me some time to realize that, and it took even longer to evolve zram and to make it ready for crypto API. The key turned out to be -- drop the idle streams list entirely. Without the idle streams list we are free to use compression algorithms that require compression stream for decompression (read), because streams are now placed in per-cpu data and each write path has to disable preemption for compression op, almost completely eliminating the aforementioned case (technically, we still have a small chance, because write path has a fast and a slow paths and the slow path is executed with the preemption enabled; but the frequency of failed fast path is too low). TEST ==== - 4 CPUs, x86_64 system - 3G zram, lzo - fio tests: read, randread, write, randwrite, rw, randrw test script [3] command: ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh BASE PATCHED jobs1 READ: 2527.2MB/s 2482.7MB/s READ: 2102.7MB/s 2045.0MB/s WRITE: 1284.3MB/s 1324.3MB/s WRITE: 1080.7MB/s 1101.9MB/s READ: 430125KB/s 437498KB/s WRITE: 430538KB/s 437919KB/s READ: 399593KB/s 403987KB/s WRITE: 399910KB/s 404308KB/s jobs2 READ: 8133.5MB/s 7854.8MB/s READ: 7086.6MB/s 6912.8MB/s WRITE: 3177.2MB/s 3298.3MB/s WRITE: 2810.2MB/s 2871.4MB/s READ: 1017.6MB/s 1023.4MB/s WRITE: 1018.2MB/s 1023.1MB/s READ: 977836KB/s 984205KB/s WRITE: 979435KB/s 985814KB/s jobs3 READ: 13557MB/s 13391MB/s READ: 11876MB/s 11752MB/s WRITE: 4641.5MB/s 4682.1MB/s WRITE: 4164.9MB/s 4179.3MB/s READ: 1453.8MB/s 1455.1MB/s WRITE: 1455.1MB/s 1458.2MB/s READ: 1387.7MB/s 1395.7MB/s WRITE: 1386.1MB/s 1394.9MB/s jobs4 READ: 20271MB/s 20078MB/s READ: 18033MB/s 17928MB/s WRITE: 6176.8MB/s 6180.5MB/s WRITE: 5686.3MB/s 5705.3MB/s READ: 2009.4MB/s 2006.7MB/s WRITE: 2007.5MB/s 2004.9MB/s READ: 1929.7MB/s 1935.6MB/s WRITE: 1926.8MB/s 1932.6MB/s jobs5 READ: 18823MB/s 19024MB/s READ: 18968MB/s 19071MB/s WRITE: 6191.6MB/s 6372.1MB/s WRITE: 5818.7MB/s 5787.1MB/s READ: 2011.7MB/s 1981.3MB/s WRITE: 2011.4MB/s 1980.1MB/s READ: 1949.3MB/s 1935.7MB/s WRITE: 1940.4MB/s 1926.1MB/s jobs6 READ: 21870MB/s 21715MB/s READ: 19957MB/s 19879MB/s WRITE: 6528.4MB/s 6537.6MB/s WRITE: 6098.9MB/s 6073.6MB/s READ: 2048.6MB/s 2049.9MB/s WRITE: 2041.7MB/s 2042.9MB/s READ: 2013.4MB/s 1990.4MB/s WRITE: 2009.4MB/s 1986.5MB/s jobs7 READ: 21359MB/s 21124MB/s READ: 19746MB/s 19293MB/s WRITE: 6660.4MB/s 6518.8MB/s WRITE: 6211.6MB/s 6193.1MB/s READ: 2089.7MB/s 2080.6MB/s WRITE: 2085.8MB/s 2076.5MB/s READ: 2041.2MB/s 2052.5MB/s WRITE: 2037.5MB/s 2048.8MB/s jobs8 READ: 20477MB/s 19974MB/s READ: 18922MB/s 18576MB/s WRITE: 6851.9MB/s 6788.3MB/s WRITE: 6407.7MB/s 6347.5MB/s READ: 2134.8MB/s 2136.1MB/s WRITE: 2132.8MB/s 2134.4MB/s READ: 2074.2MB/s 2069.6MB/s WRITE: 2087.3MB/s 2082.4MB/s jobs9 READ: 19797MB/s 19994MB/s READ: 18806MB/s 18581MB/s WRITE: 6878.7MB/s 6822.7MB/s WRITE: 6456.8MB/s 6447.2MB/s READ: 2141.1MB/s 2154.7MB/s WRITE: 2144.4MB/s 2157.3MB/s READ: 2084.1MB/s 2085.1MB/s WRITE: 2091.5MB/s 2092.5MB/s jobs10 READ: 19794MB/s 19784MB/s READ: 18794MB/s 18745MB/s WRITE: 6984.4MB/s 6676.3MB/s WRITE: 6532.3MB/s 6342.7MB/s READ: 2150.6MB/s 2155.4MB/s WRITE: 2156.8MB/s 2161.5MB/s READ: 2106.4MB/s 2095.6MB/s WRITE: 2109.7MB/s 2098.4MB/s BASE PATCHED jobs1 perfstat stalled-cycles-frontend 102,480,595,419 ( 41.53%) 114,508,864,804 ( 46.92%) stalled-cycles-backend 51,941,417,832 ( 21.05%) 46,836,112,388 ( 19.19%) instructions 283,612,054,215 ( 1.15) 283,918,134,959 ( 1.16) branches 56,372,560,385 ( 724.923) 56,449,814,753 ( 733.766) branch-misses 374,826,000 ( 0.66%) 326,935,859 ( 0.58%) jobs2 perfstat stalled-cycles-frontend 155,142,745,777 ( 40.99%) 164,170,979,198 ( 43.82%) stalled-cycles-backend 70,813,866,387 ( 18.71%) 66,456,858,165 ( 17.74%) instructions 463,436,648,173 ( 1.22) 464,221,890,191 ( 1.24) branches 91,088,733,902 ( 760.088) 91,278,144,546 ( 769.133) branch-misses 504,460,363 ( 0.55%) 394,033,842 ( 0.43%) jobs3 perfstat stalled-cycles-frontend 201,300,397,212 ( 39.84%) 223,969,902,257 ( 44.44%) stalled-cycles-backend 87,712,593,974 ( 17.36%) 81,618,888,712 ( 16.19%) instructions 642,869,545,023 ( 1.27) 644,677,354,132 ( 1.28) branches 125,724,560,594 ( 690.682) 126,133,159,521 ( 694.542) branch-misses 527,941,798 ( 0.42%) 444,782,220 ( 0.35%) jobs4 perfstat stalled-cycles-frontend 246,701,197,429 ( 38.12%) 280,076,030,886 ( 43.29%) stalled-cycles-backend 119,050,341,112 ( 18.40%) 110,955,641,671 ( 17.15%) instructions 822,716,962,127 ( 1.27) 825,536,969,320 ( 1.28) branches 160,590,028,545 ( 688.614) 161,152,996,915 ( 691.068) branch-misses 650,295,287 ( 0.40%) 550,229,113 ( 0.34%) jobs5 perfstat stalled-cycles-frontend 298,958,462,516 ( 38.30%) 344,852,200,358 ( 44.16%) stalled-cycles-backend 137,558,742,122 ( 17.62%) 129,465,067,102 ( 16.58%) instructions 1,005,714,688,752 ( 1.29) 1,007,657,999,432 ( 1.29) branches 195,988,773,962 ( 697.730) 196,446,873,984 ( 700.319) branch-misses 695,818,940 ( 0.36%) 624,823,263 ( 0.32%) jobs6 perfstat stalled-cycles-frontend 334,497,602,856 ( 36.71%) 387,590,419,779 ( 42.38%) stalled-cycles-backend 163,539,365,335 ( 17.95%) 152,640,193,639 ( 16.69%) instructions 1,184,738,177,851 ( 1.30) 1,187,396,281,677 ( 1.30) branches 230,592,915,640 ( 702.902) 231,253,802,882 ( 702.356) branch-misses 747,934,786 ( 0.32%) 643,902,424 ( 0.28%) jobs7 perfstat stalled-cycles-frontend 396,724,684,187 ( 37.71%) 460,705,858,952 ( 43.84%) stalled-cycles-backend 188,096,616,496 ( 17.88%) 175,785,787,036 ( 16.73%) instructions 1,364,041,136,608 ( 1.30) 1,366,689,075,112 ( 1.30) branches 265,253,096,936 ( 700.078) 265,890,524,883 ( 702.839) branch-misses 784,991,589 ( 0.30%) 729,196,689 ( 0.27%) jobs8 perfstat stalled-cycles-frontend 440,248,299,870 ( 36.92%) 509,554,793,816 ( 42.46%) stalled-cycles-backend 222,575,930,616 ( 18.67%) 213,401,248,432 ( 17.78%) instructions 1,542,262,045,114 ( 1.29) 1,545,233,932,257 ( 1.29) branches 299,775,178,439 ( 697.666) 300,528,458,505 ( 694.769) branch-misses 847,496,084 ( 0.28%) 748,794,308 ( 0.25%) jobs9 perfstat stalled-cycles-frontend 506,269,882,480 ( 37.86%) 592,798,032,820 ( 44.43%) stalled-cycles-backend 253,192,498,861 ( 18.93%) 233,727,666,185 ( 17.52%) instructions 1,721,985,080,913 ( 1.29) 1,724,666,236,005 ( 1.29) branches 334,517,360,255 ( 694.134) 335,199,758,164 ( 697.131) branch-misses 873,496,730 ( 0.26%) 815,379,236 ( 0.24%) jobs10 perfstat stalled-cycles-frontend 549,063,363,749 ( 37.18%) 651,302,376,662 ( 43.61%) stalled-cycles-backend 281,680,986,810 ( 19.07%) 277,005,235,582 ( 18.55%) instructions 1,901,859,271,180 ( 1.29) 1,906,311,064,230 ( 1.28) branches 369,398,536,153 ( 694.004) 370,527,696,358 ( 688.409) branch-misses 967,929,335 ( 0.26%) 890,125,056 ( 0.24%) BASE PATCHED seconds elapsed 79.421641008 78.735285546 seconds elapsed 61.471246133 60.869085949 seconds elapsed 62.317058173 62.224188495 seconds elapsed 60.030739363 60.081102518 seconds elapsed 74.070398362 74.317582865 seconds elapsed 84.985953007 85.414364176 seconds elapsed 97.724553255 98.173311344 seconds elapsed 109.488066758 110.268399318 seconds elapsed 122.768189405 122.967164498 seconds elapsed 135.130035105 136.934770801 On my other system (8 x86_64 CPUs, short version of test results): BASE PATCHED seconds elapsed 19.518065994 19.806320662 seconds elapsed 15.172772749 15.594718291 seconds elapsed 13.820925970 13.821708564 seconds elapsed 13.293097816 14.585206405 seconds elapsed 16.207284118 16.064431606 seconds elapsed 17.958376158 17.771825767 seconds elapsed 19.478009164 19.602961508 seconds elapsed 21.347152811 21.352318709 seconds elapsed 24.478121126 24.171088735 seconds elapsed 26.865057442 26.767327618 So performance-wise the numbers are quite similar. Also update zcomp interface to be more aligned with the crypto API. [1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2 [2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2 [3] https://github.com/sergey-senozhatsky/zram-perf-test Link: http://lkml.kernel.org/r/20160531122017.2878-3-sergey.senozhatsky@xxxxxxxxx Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx> Suggested-by: Minchan Kim <minchan@xxxxxxxxxx> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> Acked-by: Minchan Kim <minchan@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- drivers/block/zram/Kconfig | 10 +--- drivers/block/zram/zcomp.c | 76 +++++++++++++++++++------------- drivers/block/zram/zcomp.h | 17 ++----- drivers/block/zram/zram_drv.c | 18 ++++--- 4 files changed, 69 insertions(+), 52 deletions(-) diff -puN drivers/block/zram/Kconfig~zram-switch-to-crypto-compress-api drivers/block/zram/Kconfig --- a/drivers/block/zram/Kconfig~zram-switch-to-crypto-compress-api +++ a/drivers/block/zram/Kconfig @@ -1,8 +1,7 @@ config ZRAM tristate "Compressed RAM block device support" - depends on BLOCK && SYSFS && ZSMALLOC - select LZO_COMPRESS - select LZO_DECOMPRESS + depends on BLOCK && SYSFS && ZSMALLOC && CRYPTO + select CRYPTO_LZO default n help Creates virtual block devices called /dev/zramX (X = 0, 1, ...). @@ -18,9 +17,8 @@ config ZRAM config ZRAM_LZ4_COMPRESS bool "Enable LZ4 algorithm support" depends on ZRAM - select LZ4_COMPRESS - select LZ4_DECOMPRESS + select CRYPTO_LZ4 default n help This option enables LZ4 compression algorithm support. Compression - algorithm can be changed using `comp_algorithm' device attribute. \ No newline at end of file + algorithm can be changed using `comp_algorithm' device attribute. diff -puN drivers/block/zram/zcomp.c~zram-switch-to-crypto-compress-api drivers/block/zram/zcomp.c --- a/drivers/block/zram/zcomp.c~zram-switch-to-crypto-compress-api +++ a/drivers/block/zram/zcomp.c @@ -14,42 +14,39 @@ #include <linux/wait.h> #include <linux/sched.h> #include <linux/cpu.h> +#include <linux/crypto.h> #include "zcomp.h" -#include "zcomp_lzo.h" -#ifdef CONFIG_ZRAM_LZ4_COMPRESS -#include "zcomp_lz4.h" -#endif -static struct zcomp_backend *backends[] = { - &zcomp_lzo, +static const char * const backends[] = { + "lzo", #ifdef CONFIG_ZRAM_LZ4_COMPRESS - &zcomp_lz4, + "lz4", #endif NULL }; -static struct zcomp_backend *find_backend(const char *compress) +static const char *find_backend(const char *compress) { int i = 0; while (backends[i]) { - if (sysfs_streq(compress, backends[i]->name)) + if (sysfs_streq(compress, backends[i])) break; i++; } return backends[i]; } -static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm) +static void zcomp_strm_free(struct zcomp_strm *zstrm) { - if (zstrm->private) - comp->backend->destroy(zstrm->private); + if (!IS_ERR_OR_NULL(zstrm->tfm)) + crypto_free_comp(zstrm->tfm); free_pages((unsigned long)zstrm->buffer, 1); kfree(zstrm); } /* - * allocate new zcomp_strm structure with ->private initialized by + * allocate new zcomp_strm structure with ->tfm initialized by * backend, return NULL on error */ static struct zcomp_strm *zcomp_strm_alloc(struct zcomp *comp, gfp_t flags) @@ -58,14 +55,14 @@ static struct zcomp_strm *zcomp_strm_all if (!zstrm) return NULL; - zstrm->private = comp->backend->create(flags); + zstrm->tfm = crypto_alloc_comp(comp->name, 0, 0); /* * allocate 2 pages. 1 for compressed data, plus 1 extra for the * case when compressed size is larger than the original one */ zstrm->buffer = (void *)__get_free_pages(flags | __GFP_ZERO, 1); - if (!zstrm->private || !zstrm->buffer) { - zcomp_strm_free(comp, zstrm); + if (IS_ERR_OR_NULL(zstrm->tfm) || !zstrm->buffer) { + zcomp_strm_free(zstrm); zstrm = NULL; } return zstrm; @@ -78,12 +75,12 @@ ssize_t zcomp_available_show(const char int i = 0; while (backends[i]) { - if (!strcmp(comp, backends[i]->name)) + if (!strcmp(comp, backends[i])) sz += scnprintf(buf + sz, PAGE_SIZE - sz - 2, - "[%s] ", backends[i]->name); + "[%s] ", backends[i]); else sz += scnprintf(buf + sz, PAGE_SIZE - sz - 2, - "%s ", backends[i]->name); + "%s ", backends[i]); i++; } sz += scnprintf(buf + sz, PAGE_SIZE - sz, "\n"); @@ -105,17 +102,38 @@ void zcomp_stream_put(struct zcomp *comp put_cpu_ptr(comp->stream); } -int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm, - const unsigned char *src, size_t *dst_len) +int zcomp_compress(struct zcomp_strm *zstrm, + const void *src, unsigned int *dst_len) { - return comp->backend->compress(src, zstrm->buffer, dst_len, - zstrm->private); + /* + * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized + * because sometimes we can endup having a bigger compressed data + * due to various reasons: for example compression algorithms tend + * to add some padding to the compressed buffer. Speaking of padding, + * comp algorithm `842' pads the compressed length to multiple of 8 + * and returns -ENOSP when the dst memory is not big enough, which + * is not something that ZRAM wants to see. We can handle the + * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we + * receive -ERRNO from the compressing backend we can't help it + * anymore. To make `842' happy we need to tell the exact size of + * the dst buffer, zram_drv will take care of the fact that + * compressed buffer is too big. + */ + *dst_len = PAGE_SIZE * 2; + + return crypto_comp_compress(zstrm->tfm, + src, PAGE_SIZE, + zstrm->buffer, dst_len); } -int zcomp_decompress(struct zcomp *comp, const unsigned char *src, - size_t src_len, unsigned char *dst) +int zcomp_decompress(struct zcomp_strm *zstrm, + const void *src, unsigned int src_len, void *dst) { - return comp->backend->decompress(src, src_len, dst); + unsigned int dst_len = PAGE_SIZE; + + return crypto_comp_decompress(zstrm->tfm, + src, src_len, + dst, &dst_len); } static int __zcomp_cpu_notifier(struct zcomp *comp, @@ -138,7 +156,7 @@ static int __zcomp_cpu_notifier(struct z case CPU_UP_CANCELED: zstrm = *per_cpu_ptr(comp->stream, cpu); if (!IS_ERR_OR_NULL(zstrm)) - zcomp_strm_free(comp, zstrm); + zcomp_strm_free(zstrm); *per_cpu_ptr(comp->stream, cpu) = NULL; break; default: @@ -209,7 +227,7 @@ void zcomp_destroy(struct zcomp *comp) struct zcomp *zcomp_create(const char *compress) { struct zcomp *comp; - struct zcomp_backend *backend; + const char *backend; int error; backend = find_backend(compress); @@ -220,7 +238,7 @@ struct zcomp *zcomp_create(const char *c if (!comp) return ERR_PTR(-ENOMEM); - comp->backend = backend; + comp->name = backend; error = zcomp_init(comp); if (error) { kfree(comp); diff -puN drivers/block/zram/zcomp.h~zram-switch-to-crypto-compress-api drivers/block/zram/zcomp.h --- a/drivers/block/zram/zcomp.h~zram-switch-to-crypto-compress-api +++ a/drivers/block/zram/zcomp.h @@ -13,12 +13,7 @@ struct zcomp_strm { /* compression/decompression buffer */ void *buffer; - /* - * The private data of the compression stream, only compression - * stream backend can touch this (e.g. compression algorithm - * working memory) - */ - void *private; + struct crypto_comp *tfm; }; /* static compression backend */ @@ -40,6 +35,8 @@ struct zcomp { struct zcomp_strm * __percpu *stream; struct zcomp_backend *backend; struct notifier_block notifier; + + const char *name; }; ssize_t zcomp_available_show(const char *comp, char *buf); @@ -51,11 +48,11 @@ void zcomp_destroy(struct zcomp *comp); struct zcomp_strm *zcomp_stream_get(struct zcomp *comp); void zcomp_stream_put(struct zcomp *comp); -int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm, - const unsigned char *src, size_t *dst_len); +int zcomp_compress(struct zcomp_strm *zstrm, + const void *src, unsigned int *dst_len); -int zcomp_decompress(struct zcomp *comp, const unsigned char *src, - size_t src_len, unsigned char *dst); +int zcomp_decompress(struct zcomp_strm *zstrm, + const void *src, unsigned int src_len, void *dst); bool zcomp_set_max_streams(struct zcomp *comp, int num_strm); #endif /* _ZCOMP_H_ */ diff -puN drivers/block/zram/zram_drv.c~zram-switch-to-crypto-compress-api drivers/block/zram/zram_drv.c --- a/drivers/block/zram/zram_drv.c~zram-switch-to-crypto-compress-api +++ a/drivers/block/zram/zram_drv.c @@ -563,7 +563,7 @@ static int zram_decompress_page(struct z unsigned char *cmem; struct zram_meta *meta = zram->meta; unsigned long handle; - size_t size; + unsigned int size; bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value); handle = meta->table[index].handle; @@ -576,10 +576,14 @@ static int zram_decompress_page(struct z } cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_RO); - if (size == PAGE_SIZE) + if (size == PAGE_SIZE) { copy_page(mem, cmem); - else - ret = zcomp_decompress(zram->comp, cmem, size, mem); + } else { + struct zcomp_strm *zstrm = zcomp_stream_get(zram->comp); + + ret = zcomp_decompress(zstrm, cmem, size, mem); + zcomp_stream_put(zram->comp); + } zs_unmap_object(meta->mem_pool, handle); bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value); @@ -646,7 +650,7 @@ static int zram_bvec_write(struct zram * int offset) { int ret = 0; - size_t clen; + unsigned int clen; unsigned long handle = 0; struct page *page; unsigned char *user_mem, *cmem, *src, *uncmem = NULL; @@ -696,7 +700,7 @@ compress_again: } zstrm = zcomp_stream_get(zram->comp); - ret = zcomp_compress(zram->comp, zstrm, uncmem, &clen); + ret = zcomp_compress(zstrm, uncmem, &clen); if (!is_partial_io(bvec)) { kunmap_atomic(user_mem); user_mem = NULL; @@ -744,7 +748,7 @@ compress_again: if (handle) goto compress_again; - pr_err("Error allocating memory for compressed page: %u, size=%zu\n", + pr_err("Error allocating memory for compressed page: %u, size=%u\n", index, clen); ret = -ENOMEM; goto out; _ -- To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html