Re: AW: [PATCH] lib/raid6: possibly use different gen and xor algorithms.,

Hristo Venev <hristo@xxxxxxxxxx> · Sun, 24 Feb 2019 21:53:48 +0000

On Sun, 2019-02-24 at 17:13 +0000, Markus Stockhausen wrote:
> > Von: Hristo Venev [hristo@xxxxxxxxxx]
> > Gesendet: Samstag, 23. Februar 2019 22:21
> > An: Jens Axboe
> > Cc: NeilBrown; Markus Stockhausen; linux-raid@xxxxxxxxxxxxxxx
> > Betreff: [PATCH] lib/raid6: possibly use different gen and xor
> > algorithms.,
> > 
> > The performance measurement of xor() was introduced in
> > fe5cbc6e06c7d8b3, but
> > its result was unused.  Given that all implementations should give
> > the same
> > output, it makes sense to use the best function for each operation.
>  
> Hi Hristo,
> 
> I just want to remind that the xor function speed test might be a bit
> tricky.
> While the generation always must calulate the full stripe, the xor
> pages
> vary from request to request. 
> 
> You can play around with the start/stop values to get a better idea
> if the
> xor function will be comparable for different test setups. E.g
> 
> int start = 0, stop = (disks>>1)-1;

Hi Markus,

I just saw that I broke the userspace test because I didn't include
<stdbool.h>. I will fix that in the next version of the patch.

That being said, I ran some tests (results below). It probably makes
sense to run xor_syndrome() multiple times per iteration with different
start and stop values.

Given that xor_syndrome() is usually called for short writes, I'd
probably run it on both halves of the disks and then on the 4 quarters.
What do you think?

Test results for `stop = disks-3`:

AMD Ryzen 2700X:
	start = disks - 4
		avx2x1: 16713 MB/s  - 0.84x
		avx2x2: 23702 MB/s  - 1.19x
		avx2x4: 19837 MB/s  - 1.00x

	start = (disks>>1) - 1
		avx2x1: 17116 MB/s  - 1.03x
		avx2x2: 18787 MB/s  - 1.14x
		avx2x4: 16547 MB/s  - 1.00x

	start = (disks>>2) - 1
		avx2x1: 12244 MB/s  - 0.84x
		avx2x2: 16257 MB/s  - 1.11x
		avx2x4: 14647 MB/s  - 1.00x

	start = 1
		avx2x1: 11824 MB/s  - 0.96x
		avx2x2: 15225 MB/s  - 1.23x
		avx2x4: 12367 MB/s  - 1.00x

	start = 0
		avx2x1: 11153 MB/s  - 0.85x
		avx2x2: 14868 MB/s  - 1.14x
		avx2x4: 13050 MB/s  - 1.00x

Intel Core i7-7500U:
	start = disks - 4
		avx2x1: 21692 MB/s  - 0.76x
		avx2x2: 27447 MB/s  - 0.96x
		avx2x4: 28553 MB/s  - 1.00x

	start = (disks>>1) - 1
		avx2x1: 18453 MB/s  - 0.79x
		avx2x2: 20117 MB/s  - 0.86x
		avx2x4: 23304 MB/s  - 1.00x

	start = (disks>>2) - 1
		avx2x1: 15703 MB/s  - 0.85x
		avx2x2: 16850 MB/s  - 0.92x
		avx2x4: 18390 MB/s  - 1.00x

	start = 1
		avx2x1: 14777 MB/s  - 0.87x
		avx2x2: 15835 MB/s  - 0.94x
		avx2x4: 16921 MB/s  - 1.00x

	start = 0
		avx2x1: 14206 MB/s  - 0.89x
		avx2x2: 15409 MB/s  - 0.96x
		avx2x4: 16012 MB/s  - 1.00x

Intel Atom C3955:
	start = disks - 4
		sse2x1:  4004 MB/s  - 1.11x
		sse2x2:  5823 MB/s  - 1.62x
		sse2x4:  3599 MB/s  - 1.00x

	start = (disks>>1) - 1
		sse2x1:  3114 MB/s  - 1.20x
		sse2x2:  3722 MB/s  - 1.44x
		sse2x4:  2587 MB/s  - 1.00x

	start = (disks>>2) - 1
		sse2x1:  2121 MB/s  - 1.05x
		sse2x2:  2565 MB/s  - 1.27x
		sse2x4:  2022 MB/s  - 1.00x

	start = 1
		sse2x1:  1978 MB/s  - 1.01x
		sse2x2:  2429 MB/s  - 1.24x
		sse2x4:  1966 MB/s  - 1.00x

	start = 0
		sse2x1:  1937 MB/s  - 1.04x
		sse2x2:  2349 MB/s  - 1.26x
		sse2x4:  1860 MB/s  - 1.00x

For smaller `stop`, x2 becomes faster than x4 on all machines I tested.
Tests with `stop = (disks>>1) - 1`:

AMD Ryzen 2700X:
        start = disks>>2
                avx2x1: 31449 MB/s  - 1.44x
                avx2x2: 32975 MB/s  - 1.51x
                avx2x4: 21789 MB/s  - 1.00x

        start = 0
                avx2x1: 24260 MB/s  - 1.07x
                avx2x2: 25347 MB/s  - 1.11x
                avx2x4: 22775 MB/s  - 1.00x

Intel Core i7-7500U:
        start = disks>>2
                avx2x1: 35639 MB/s  - 1.01x
                avx2x2: 42438 MB/s  - 1.21x
                avx2x4: 35146 MB/s  - 1.00x

        start = 0
                avx2x1: 28471 MB/s  - 1.09x
                avx2x2: 28736 MB/s  - 1.10x
                avx2x4: 26075 MB/s  - 1.00x

Intel Atom C3955:
        start = disks>>2
                sse2x1:  6461 MB/s  - 1.88x
                sse2x2:  7548 MB/s  - 2.20x
                sse2x4:  3435 MB/s  - 1.00x

        start = 0
                sse2x1:  4155 MB/s  - 1.59x
                sse2x2:  4522 MB/s  - 1.73x
                sse2x4:  2612 MB/s  - 1.00x

> 
> Best regards.
> 
> Markus
> 
> > For example, on my machine more unrolling can benefit gen but not
> > xor:
> > 
> > raid6: sse2x1   gen()  9560 MB/s
> > raid6: sse2x1   xor()  7021 MB/s
> > raid6: sse2x2   gen() 11741 MB/s
> > raid6: sse2x2   xor()  8111 MB/s
> > raid6: sse2x4   gen() 13801 MB/s
> > raid6: sse2x4   xor()  8002 MB/s
> > raid6: avx2x1   gen() 19298 MB/s
> > raid6: avx2x1   xor() 13780 MB/s
> > raid6: avx2x2   gen() 23303 MB/s
> > raid6: avx2x2   xor() 15258 MB/s
> > raid6: avx2x4   gen() 27255 MB/s
> > raid6: avx2x4   xor() 14617 MB/s
> > raid6: using algorithm avx2x4 gen() 27255 MB/s
> > raid6: and algorithm avx2x2 xor() 15258 MB/s, rmw enabled
> > 
> > Signed-off-by: Hristo Venev <hristo@xxxxxxxxxx>
> > ...
Attachment:
signature.asc

Description: This is a digitally signed message part