On December 13, 2019 9:12:47 AM PST, Zhengyuan Liu <liuzhengyuan@xxxxxxxxxx> wrote: > > >On 2019/12/13 上午12:29, hpa@xxxxxxxxx wrote: >> On December 12, 2019 8:08:36 AM PST, Zhengyuan Liu ><liuzhengyuan@xxxxxxxxxx> wrote: >>> >>> >>> On 2019/12/12 上午3:26, Song Liu wrote: >>>> On Wed, Dec 4, 2019 at 7:13 PM Zhengyuan Liu >>> <liuzhengyuan@xxxxxxxxxx> wrote: >>>>> >>>>> There are several algorithms available for raid6 to generate xor >and >>> syndrome >>>>> parity, including basic int1, int2 ... int32 and SIMD optimized >>> implementation >>>>> like sse and neon. To test and choose the best algorithms at the >>> initial >>>>> stage, we need provide enough disk data to feed the algorithms. >>> However, the >>>>> disk number we provided depends on page size and gfmul table, >seeing >>> bellow: >>>>> >>>>> int __init raid6_select_algo(void) >>>>> { >>>>> const int disks = (65536/PAGE_SIZE) + 2; >>>>> ... >>>>> } >>>>> >>>>> So when come to 64K PAGE_SIZE, there is only one data disk plus 2 >>> parity disk, >>>>> as a result the chosed algorithm is not reliable. For example, on >my >>> arm64 >>>>> machine with 64K page enabled, it will choose intx32 as the best >>> one, although >>>>> the NEON implementation is better. >>>> >>>> I think we can fix this by simply change raid6_select_algo()? We >>> still have >>> >>> Actually I fixed this by only changing raid6_select_algo() in patch >V1, >>> >>> but I found lib/raid6/test has also defined a block size named >>> PAGE_SIZE >>> for himself in pq.h: >>> >>> #ifndef PAGE_SIZE >>> # define PAGE_SIZE 4096 >>> #endif >>> >>> There is no need to separately define two block size for testing, >just >>> unify them to use one. >>> >>>> >>>> #define STRIPE_SIZE PAGE_SIZE >>>> >>>> So testing with PAGE_SIZE represents real performance. >>>> >>> I originally preferred to choose PAGE_SIZE as the block size, but >there >>> >>> is no suitable data source since gmful table has only 64K. It's too >>> expensive to use a random number generator to fill all the data. >>> >>> My test result shows there is no obvious differences between 4k >block >>> size and 64 block size under 64k PAGE_SIZE. >>> >>> Thanks, >>> Zhengyuan >>>> Thanks, >>>> Song >>>> >> >> The test directory tests mainly for correctness, not comparative >performance. >> >> The main reason for use a full page as the actual RAID code will >would be for architectures which may have high setup overhead, such as >off-core accelerators. In that case using a smaller buffer may give >massively wrong results and we lose out on the use of the accelerator >unless we have hardcoded priorities; the other thing is that you might >find that your data set fits in a lower level cache that would be >realistic during actual operation, which again would give very wrong >results for cache bypassing vs non-bypassing algorithms. >> >> The use of the RAID-6 coefficient table is just a matter of >convenience; it was already there, and at that time the highest page >size in use was 32K; this is also why zisofs uses that block size by >default. I wanted to about multiple dynamic allocations as much as >possible in order to avoid cache- or TLB-related nondeterminism. > >Thanks for the detailed explanations. > >> >> The more I think about it, the more I think that he best thing to do >is to do a single order-3 dynamic allocation, and just fill the first >six pages with some arbitrary content (either copying the table as many >times as needed, but it would be better to use a deterministic PRNG >which is *not* Galois field based; something as simple as x1 = A*x0 + B >where A is a prime should be plenty fine) and use the second two as the >target buffer, then to the benchmarking with 6 data disks/8 total disks >for all architectures; this then also models the differences in RAID >stripe size. We are running in a very benign environment at this >point, so if we end up sleeping a bit to be able to do a high order >allocation, it is not a problem (and these days, the mm can do >compaction if needed.) >> > >Actually I had tried to fix this problem, as you suggested about three >years ago, by simply doing a order-3 allocation and fill them with a >PRNG(patch V1 uses a simple linear congruential PRNG which was not >suitable as you pointed out and patch V2 uses a non-linear PRNG which >seems to be a bit overkill), seeing the discussion in bellow link: > >https://groups.google.com/forum/#!msg/fa.linux.kernel/5uoxjvg0_gg/mk_ekZrOBgAJ > >To make things as simple as possible,I think we can just copy table. You're right, we have more or less come fill circle... I still think it is the best choice. Maybe use plain kmalloc() instead, though, and I strongly suggest using __GFP_NOFAIL instead of failing to initialize. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.