On December 12, 2019 8:08:36 AM PST, Zhengyuan Liu <liuzhengyuan@xxxxxxxxxx> wrote: > > >On 2019/12/12 上午3:26, Song Liu wrote: >> On Wed, Dec 4, 2019 at 7:13 PM Zhengyuan Liu ><liuzhengyuan@xxxxxxxxxx> wrote: >>> >>> There are several algorithms available for raid6 to generate xor and >syndrome >>> parity, including basic int1, int2 ... int32 and SIMD optimized >implementation >>> like sse and neon. To test and choose the best algorithms at the >initial >>> stage, we need provide enough disk data to feed the algorithms. >However, the >>> disk number we provided depends on page size and gfmul table, seeing >bellow: >>> >>> int __init raid6_select_algo(void) >>> { >>> const int disks = (65536/PAGE_SIZE) + 2; >>> ... >>> } >>> >>> So when come to 64K PAGE_SIZE, there is only one data disk plus 2 >parity disk, >>> as a result the chosed algorithm is not reliable. For example, on my >arm64 >>> machine with 64K page enabled, it will choose intx32 as the best >one, although >>> the NEON implementation is better. >> >> I think we can fix this by simply change raid6_select_algo()? We >still have > >Actually I fixed this by only changing raid6_select_algo() in patch V1, > >but I found lib/raid6/test has also defined a block size named >PAGE_SIZE >for himself in pq.h: > > #ifndef PAGE_SIZE > # define PAGE_SIZE 4096 > #endif > >There is no need to separately define two block size for testing, just >unify them to use one. > >> >> #define STRIPE_SIZE PAGE_SIZE >> >> So testing with PAGE_SIZE represents real performance. >> >I originally preferred to choose PAGE_SIZE as the block size, but there > >is no suitable data source since gmful table has only 64K. It's too >expensive to use a random number generator to fill all the data. > >My test result shows there is no obvious differences between 4k block >size and 64 block size under 64k PAGE_SIZE. > >Thanks, >Zhengyuan >> Thanks, >> Song >> The test directory tests mainly for correctness, not comparative performance. The main reason for use a full page as the actual RAID code will would be for architectures which may have high setup overhead, such as off-core accelerators. In that case using a smaller buffer may give massively wrong results and we lose out on the use of the accelerator unless we have hardcoded priorities; the other thing is that you might find that your data set fits in a lower level cache that would be realistic during actual operation, which again would give very wrong results for cache bypassing vs non-bypassing algorithms. The use of the RAID-6 coefficient table is just a matter of convenience; it was already there, and at that time the highest page size in use was 32K; this is also why zisofs uses that block size by default. I wanted to about multiple dynamic allocations as much as possible in order to avoid cache- or TLB-related nondeterminism. The more I think about it, the more I think that he best thing to do is to do a single order-3 dynamic allocation, and just fill the first six pages with some arbitrary content (either copying the table as many times as needed, but it would be better to use a deterministic PRNG which is *not* Galois field based; something as simple as x1 = A*x0 + B where A is a prime should be plenty fine) and use the second two as the target buffer, then to the benchmarking with 6 data disks/8 total disks for all architectures; this then also models the differences in RAID stripe size. We are running in a very benign environment at this point, so if we end up sleeping a bit to be able to do a high order allocation, it is not a problem (and these days, the mm can do compaction if needed.) -- Sent from my Android device with K-9 Mail. Please excuse my brevity.