On Sun, 2019-02-24 at 17:13 +0000, Markus Stockhausen wrote: > > Von: Hristo Venev [hristo@xxxxxxxxxx] > > Gesendet: Samstag, 23. Februar 2019 22:21 > > An: Jens Axboe > > Cc: NeilBrown; Markus Stockhausen; linux-raid@xxxxxxxxxxxxxxx > > Betreff: [PATCH] lib/raid6: possibly use different gen and xor > > algorithms., > > > > The performance measurement of xor() was introduced in > > fe5cbc6e06c7d8b3, but > > its result was unused. Given that all implementations should give > > the same > > output, it makes sense to use the best function for each operation. > > Hi Hristo, > > I just want to remind that the xor function speed test might be a bit > tricky. > While the generation always must calulate the full stripe, the xor > pages > vary from request to request. > > You can play around with the start/stop values to get a better idea > if the > xor function will be comparable for different test setups. E.g > > int start = 0, stop = (disks>>1)-1; Hi Markus, I just saw that I broke the userspace test because I didn't include <stdbool.h>. I will fix that in the next version of the patch. That being said, I ran some tests (results below). It probably makes sense to run xor_syndrome() multiple times per iteration with different start and stop values. Given that xor_syndrome() is usually called for short writes, I'd probably run it on both halves of the disks and then on the 4 quarters. What do you think? Test results for `stop = disks-3`: AMD Ryzen 2700X: start = disks - 4 avx2x1: 16713 MB/s - 0.84x avx2x2: 23702 MB/s - 1.19x avx2x4: 19837 MB/s - 1.00x start = (disks>>1) - 1 avx2x1: 17116 MB/s - 1.03x avx2x2: 18787 MB/s - 1.14x avx2x4: 16547 MB/s - 1.00x start = (disks>>2) - 1 avx2x1: 12244 MB/s - 0.84x avx2x2: 16257 MB/s - 1.11x avx2x4: 14647 MB/s - 1.00x start = 1 avx2x1: 11824 MB/s - 0.96x avx2x2: 15225 MB/s - 1.23x avx2x4: 12367 MB/s - 1.00x start = 0 avx2x1: 11153 MB/s - 0.85x avx2x2: 14868 MB/s - 1.14x avx2x4: 13050 MB/s - 1.00x Intel Core i7-7500U: start = disks - 4 avx2x1: 21692 MB/s - 0.76x avx2x2: 27447 MB/s - 0.96x avx2x4: 28553 MB/s - 1.00x start = (disks>>1) - 1 avx2x1: 18453 MB/s - 0.79x avx2x2: 20117 MB/s - 0.86x avx2x4: 23304 MB/s - 1.00x start = (disks>>2) - 1 avx2x1: 15703 MB/s - 0.85x avx2x2: 16850 MB/s - 0.92x avx2x4: 18390 MB/s - 1.00x start = 1 avx2x1: 14777 MB/s - 0.87x avx2x2: 15835 MB/s - 0.94x avx2x4: 16921 MB/s - 1.00x start = 0 avx2x1: 14206 MB/s - 0.89x avx2x2: 15409 MB/s - 0.96x avx2x4: 16012 MB/s - 1.00x Intel Atom C3955: start = disks - 4 sse2x1: 4004 MB/s - 1.11x sse2x2: 5823 MB/s - 1.62x sse2x4: 3599 MB/s - 1.00x start = (disks>>1) - 1 sse2x1: 3114 MB/s - 1.20x sse2x2: 3722 MB/s - 1.44x sse2x4: 2587 MB/s - 1.00x start = (disks>>2) - 1 sse2x1: 2121 MB/s - 1.05x sse2x2: 2565 MB/s - 1.27x sse2x4: 2022 MB/s - 1.00x start = 1 sse2x1: 1978 MB/s - 1.01x sse2x2: 2429 MB/s - 1.24x sse2x4: 1966 MB/s - 1.00x start = 0 sse2x1: 1937 MB/s - 1.04x sse2x2: 2349 MB/s - 1.26x sse2x4: 1860 MB/s - 1.00x For smaller `stop`, x2 becomes faster than x4 on all machines I tested. Tests with `stop = (disks>>1) - 1`: AMD Ryzen 2700X: start = disks>>2 avx2x1: 31449 MB/s - 1.44x avx2x2: 32975 MB/s - 1.51x avx2x4: 21789 MB/s - 1.00x start = 0 avx2x1: 24260 MB/s - 1.07x avx2x2: 25347 MB/s - 1.11x avx2x4: 22775 MB/s - 1.00x Intel Core i7-7500U: start = disks>>2 avx2x1: 35639 MB/s - 1.01x avx2x2: 42438 MB/s - 1.21x avx2x4: 35146 MB/s - 1.00x start = 0 avx2x1: 28471 MB/s - 1.09x avx2x2: 28736 MB/s - 1.10x avx2x4: 26075 MB/s - 1.00x Intel Atom C3955: start = disks>>2 sse2x1: 6461 MB/s - 1.88x sse2x2: 7548 MB/s - 2.20x sse2x4: 3435 MB/s - 1.00x start = 0 sse2x1: 4155 MB/s - 1.59x sse2x2: 4522 MB/s - 1.73x sse2x4: 2612 MB/s - 1.00x > > Best regards. > > Markus > > > For example, on my machine more unrolling can benefit gen but not > > xor: > > > > raid6: sse2x1 gen() 9560 MB/s > > raid6: sse2x1 xor() 7021 MB/s > > raid6: sse2x2 gen() 11741 MB/s > > raid6: sse2x2 xor() 8111 MB/s > > raid6: sse2x4 gen() 13801 MB/s > > raid6: sse2x4 xor() 8002 MB/s > > raid6: avx2x1 gen() 19298 MB/s > > raid6: avx2x1 xor() 13780 MB/s > > raid6: avx2x2 gen() 23303 MB/s > > raid6: avx2x2 xor() 15258 MB/s > > raid6: avx2x4 gen() 27255 MB/s > > raid6: avx2x4 xor() 14617 MB/s > > raid6: using algorithm avx2x4 gen() 27255 MB/s > > raid6: and algorithm avx2x2 xor() 15258 MB/s, rmw enabled > > > > Signed-off-by: Hristo Venev <hristo@xxxxxxxxxx> > > ...
Attachment:
signature.asc
Description: This is a digitally signed message part