Re: [PATCH] Use strict priority ranking for pq gen() benchmarking

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Dear Dirk,


Thank you for the detailed reply.


Am 31.12.21 um 09:52 schrieb Dirk Müller:
Am 2021-12-30 14:46, schrieb Paul Menzel:

Can the AVX2 wins over AVX512 be explained, or does it point to
some implementation problem?

I've not yet analyzed this deep enough to have a defendable
explanation ready, sorry. My patch is not changing the situation in
regards to AVX512 vs AVX2 (both are ranked equal, same like before). The only change I do is that SSE2 is ranked lower than AVX2, so cpu generations that have AVX2 will stop benchmarking at AVX2 rather than
also including SSE2 benchmark runs.

The current benchmark routine is likely too naive when you look at
the last 20+ years of cpu design improvements (prefetching,
Out-of-Order Execution, Turbo modes, Energy-Cores, AVX512 licensing
turbo and many other aspects). This is not in my current focus, my
current focus is on lowering the tax of the benchmark.

Thank you. Sorry for hijacking this thread with the question.

By the way, Borislav did not give much credit to the benchmarks results [1].

I have seen that as well, there are two remarks on this (both not invalidating what Borislav wrote):

* the comment was about xor(), this patch is about gen()
* the benchmark logic does a relative ranking of approaches, so the absolute number fluctuation doesn't matter if they still rank the same.

Indeed.

By giving AVXx variants higher priority over SSE, we can generally
skip 3 benchmarks which speeds this up by 33% - 50%, depending on
whether AVX512 is available.
Please give concrete timing numbers for one system you tested this on.

I have given an explanation of how this patch affects number of benchmarks that are run. how long they take depends on other factors. this is the list of benchmarks configured (lib/raid6/algos.c the raid6_algos6[] array):


   #if defined(__x86_64__) && !defined(__arch_um__)
   #ifdef CONFIG_AS_AVX512
           &raid6_avx512x4,
           &raid6_avx512x2,
           &raid6_avx512x1,
   #endif
           &raid6_avx2x4,
           &raid6_avx2x2,
           &raid6_avx2x1,
           &raid6_sse2x4,
           &raid6_sse2x2,
           &raid6_sse2x1,
   #endif

without this patch, all 9 are executed. with this patch, the last 3 (sse2x*) are skipped, leading to a 3 out of 6 or 3 out of 9 (depending on whether or not AVX512 is enabled) improvement, or 33%-50% as written above.

I'm open to any suggestion of a wording change that makes this clearer.

As in the other patch, having an additional statement like below, would help me.

With a 250HZ kernel, on Intel Xeon(?) … according to `initcall_debug` the former load time is X ms, and now only Y ms.


Kind regards,

Paul



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux