Re: [PATCH v2 2/2] md/raid6: fix algorithm choice under larger PAGE_SIZE

hpa@xxxxxxxxx · Thu, 12 Dec 2019 08:29:29 -0800

On December 12, 2019 8:08:36 AM PST, Zhengyuan Liu <liuzhengyuan@xxxxxxxxxx> wrote:
>
>
>On 2019/12/12 上午3:26, Song Liu wrote:
>> On Wed, Dec 4, 2019 at 7:13 PM Zhengyuan Liu
><liuzhengyuan@xxxxxxxxxx> wrote:
>>>
>>> There are several algorithms available for raid6 to generate xor and
>syndrome
>>> parity, including basic int1, int2 ... int32 and SIMD optimized
>implementation
>>> like sse and neon.  To test and choose the best algorithms at the
>initial
>>> stage, we need provide enough disk data to feed the algorithms.
>However, the
>>> disk number we provided depends on page size and gfmul table, seeing
>bellow:
>>>
>>>          int __init raid6_select_algo(void)
>>>          {
>>>                  const int disks = (65536/PAGE_SIZE) + 2;
>>>                  ...
>>>          }
>>>
>>> So when come to 64K PAGE_SIZE, there is only one data disk plus 2
>parity disk,
>>> as a result the chosed algorithm is not reliable. For example, on my
>arm64
>>> machine with 64K page enabled, it will choose intx32 as the best
>one, although
>>> the NEON implementation is better.
>> 
>> I think we can fix this by simply change raid6_select_algo()? We
>still have
>
>Actually I fixed this by only changing raid6_select_algo() in patch V1,
>
>but I found lib/raid6/test has also defined a block size named
>PAGE_SIZE 
>for himself in pq.h:
>
>	#ifndef PAGE_SIZE
>	# define PAGE_SIZE 4096
>	#endif
>
>There is no need to separately define two block size for testing, just 
>unify them to use one.
>
>> 
>> #define STRIPE_SIZE             PAGE_SIZE
>> 
>> So testing with PAGE_SIZE represents real performance.
>> 
>I originally preferred to choose PAGE_SIZE as the block size, but there
>
>is no suitable data source since gmful table has only 64K. It's too 
>expensive to use a random number generator to fill all the data.
>
>My test result shows there is no obvious differences between 4k block 
>size and 64 block size under 64k PAGE_SIZE.
>
>Thanks,
>Zhengyuan
>> Thanks,
>> Song
>> 

The test directory tests mainly for correctness, not comparative performance.

The main reason for use a full page as the actual RAID code will would be for architectures which may have high setup overhead, such as off-core accelerators. In that case using a smaller buffer may give massively wrong results and we lose out on the use of the accelerator unless we have hardcoded priorities; the other thing is that you might find that your data set fits in a lower level cache that would be realistic during actual operation, which again would give very wrong results for cache bypassing vs non-bypassing algorithms.

The use of the RAID-6 coefficient table is just a matter of convenience; it was already there, and at that time the highest page size in use was 32K; this is also why zisofs uses that block size by default. I wanted to about multiple dynamic allocations as much as possible in order to avoid cache- or TLB-related nondeterminism.

The more I think about it, the more I think that he best thing to do is to do a single order-3 dynamic allocation, and just fill the first six pages with some arbitrary content (either copying the table as many times as needed, but it would be better to use a deterministic PRNG which is *not* Galois field based; something as simple as x1 = A*x0 + B where A is a prime should be plenty fine) and use the second two as the target buffer, then to the benchmarking with 6 data disks/8 total disks for all architectures; this then also models the differences in RAID stripe size.  We are running in a very benign environment at this point, so if we end up sleeping a bit to be able to do a high order allocation, it is not a problem (and these days, the mm can do compaction if needed.)
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.