Re: [PATCH v2 2/2] md/raid6: fix algorithm choice under larger PAGE_SIZE

hpa@xxxxxxxxx · Fri, 13 Dec 2019 12:05:05 -0800

On December 13, 2019 9:12:47 AM PST, Zhengyuan Liu <liuzhengyuan@xxxxxxxxxx> wrote:
>
>
>On 2019/12/13 上午12:29, hpa@xxxxxxxxx wrote:
>> On December 12, 2019 8:08:36 AM PST, Zhengyuan Liu
><liuzhengyuan@xxxxxxxxxx> wrote:
>>>
>>>
>>> On 2019/12/12 上午3:26, Song Liu wrote:
>>>> On Wed, Dec 4, 2019 at 7:13 PM Zhengyuan Liu
>>> <liuzhengyuan@xxxxxxxxxx> wrote:
>>>>>
>>>>> There are several algorithms available for raid6 to generate xor
>and
>>> syndrome
>>>>> parity, including basic int1, int2 ... int32 and SIMD optimized
>>> implementation
>>>>> like sse and neon.  To test and choose the best algorithms at the
>>> initial
>>>>> stage, we need provide enough disk data to feed the algorithms.
>>> However, the
>>>>> disk number we provided depends on page size and gfmul table,
>seeing
>>> bellow:
>>>>>
>>>>>           int __init raid6_select_algo(void)
>>>>>           {
>>>>>                   const int disks = (65536/PAGE_SIZE) + 2;
>>>>>                   ...
>>>>>           }
>>>>>
>>>>> So when come to 64K PAGE_SIZE, there is only one data disk plus 2
>>> parity disk,
>>>>> as a result the chosed algorithm is not reliable. For example, on
>my
>>> arm64
>>>>> machine with 64K page enabled, it will choose intx32 as the best
>>> one, although
>>>>> the NEON implementation is better.
>>>>
>>>> I think we can fix this by simply change raid6_select_algo()? We
>>> still have
>>>
>>> Actually I fixed this by only changing raid6_select_algo() in patch
>V1,
>>>
>>> but I found lib/raid6/test has also defined a block size named
>>> PAGE_SIZE
>>> for himself in pq.h:
>>>
>>> 	#ifndef PAGE_SIZE
>>> 	# define PAGE_SIZE 4096
>>> 	#endif
>>>
>>> There is no need to separately define two block size for testing,
>just
>>> unify them to use one.
>>>
>>>>
>>>> #define STRIPE_SIZE             PAGE_SIZE
>>>>
>>>> So testing with PAGE_SIZE represents real performance.
>>>>
>>> I originally preferred to choose PAGE_SIZE as the block size, but
>there
>>>
>>> is no suitable data source since gmful table has only 64K. It's too
>>> expensive to use a random number generator to fill all the data.
>>>
>>> My test result shows there is no obvious differences between 4k
>block
>>> size and 64 block size under 64k PAGE_SIZE.
>>>
>>> Thanks,
>>> Zhengyuan
>>>> Thanks,
>>>> Song
>>>>
>> 
>> The test directory tests mainly for correctness, not comparative
>performance.
>> 
>> The main reason for use a full page as the actual RAID code will
>would be for architectures which may have high setup overhead, such as
>off-core accelerators. In that case using a smaller buffer may give
>massively wrong results and we lose out on the use of the accelerator
>unless we have hardcoded priorities; the other thing is that you might
>find that your data set fits in a lower level cache that would be
>realistic during actual operation, which again would give very wrong
>results for cache bypassing vs non-bypassing algorithms.
>> 
>> The use of the RAID-6 coefficient table is just a matter of
>convenience; it was already there, and at that time the highest page
>size in use was 32K; this is also why zisofs uses that block size by
>default. I wanted to about multiple dynamic allocations as much as
>possible in order to avoid cache- or TLB-related nondeterminism.
>
>Thanks for the detailed explanations.
>
>> 
>> The more I think about it, the more I think that he best thing to do
>is to do a single order-3 dynamic allocation, and just fill the first
>six pages with some arbitrary content (either copying the table as many
>times as needed, but it would be better to use a deterministic PRNG
>which is *not* Galois field based; something as simple as x1 = A*x0 + B
>where A is a prime should be plenty fine) and use the second two as the
>target buffer, then to the benchmarking with 6 data disks/8 total disks
>for all architectures; this then also models the differences in RAID
>stripe size.  We are running in a very benign environment at this
>point, so if we end up sleeping a bit to be able to do a high order
>allocation, it is not a problem (and these days, the mm can do
>compaction if needed.)
>> 
>
>Actually I had tried to fix this problem, as you suggested about three 
>years ago, by simply doing a order-3 allocation and fill them with a 
>PRNG(patch V1 uses a simple linear congruential PRNG which was not 
>suitable as you pointed out and patch V2 uses a non-linear PRNG which 
>seems to be a bit overkill), seeing the discussion in bellow link:
>
>https://groups.google.com/forum/#!msg/fa.linux.kernel/5uoxjvg0_gg/mk_ekZrOBgAJ
>
>To make things as simple as possible,I think we can just copy table.

A PRNG would actually be better; the table is highly suboptimal exactly because it is Galois-field derived. The reason it works is that the data pattern really doesn't matter; no sane implementation is going to have data dependencies here.

A better choice than multiplication is circular multiplication, which also has the advantage of being nearly as cheap as a plain multiply on most systems:

static inline u32 circular_mult(u32 x, u32 y)
{
    u64 z = x * y;
    return (u32)(z >> 32) + (u32)z;
}

If people want more nonlinearity I suggest simply making it quadratic:

x = circular_mult(circular_mult(A, x) + B, x) + C;

... with A, B, and C being large prime numbers just to improve diffusion.

The kernel *text* section would also work, but I don't know if any architectures do X-only text.

If you post something like that I'll back you up on it.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.