On Tue, Nov 10, 2020 at 09:20:46PM +0800, Li Qiang wrote: > > > 在 2020/11/10 18:46, Dave Martin 写道: > > On Mon, Nov 09, 2020 at 11:43:35AM +0800, Li Qiang wrote: > >> Hi Dave, > >> > >> I carefully read the ideas you provided and the sample code you gave me.:) > >> > >> 在 2020/11/6 0:53, Dave Martin 写道: > >>> On Tue, Nov 03, 2020 at 08:15:05PM +0800, l00374334 wrote: > >>>> From: liqiang <liqiang64@xxxxxxxxxx> > >>>> > >>>> Dear all, > >>>> > >>>> Thank you for taking the precious time to read this email! > >>>> > >>>> Let me introduce the implementation ideas of my code here. > >>>> > >>>> In the process of using the compression library libz, I found that the adler32 > >>>> checksum always occupies a higher hot spot, so I paid attention to this algorithm. > >>>> After getting in touch with the SVE instruction set of armv8, I realized that > >>>> SVE can effectively accelerate adler32, so I made some attempts and got correct > >>>> and better performance results. I very much hope that this modification can be > >>>> applied to the kernel. > >>>> > >>>> Below is my analysis process: > >>>> > >>>> Adler32 algorithm > >>>> ================= > >>>> > >>>> Reference: https://en.wikipedia.org/wiki/Adler-32 > >>>> > >>>> Assume that the buf of the Adler32 checksum to be calculated is D and the length is n: > >>>> > >>>> A = 1 + D1 + D2 + ... + Dn (mod 65521) > >>>> > >>>> B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521) > >>>> = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521) > >>>> > >>>> Adler-32(D) = B × 65536 + A > >>>> > >>>> In C, an inefficient but straightforward implementation is: > >>>> > >>>> const uint32_t MOD_ADLER = 65521; > >>>> > >>>> uint32_t adler32(unsigned char *data, size_t len) > >>>> { > >>>> uint32_t a = 1, b = 0; > >>>> size_t index; > >>>> > >>>> // Process each byte of the data in order > >>>> for (index = 0; index < len; ++index) > >>>> { > >>>> a = (a + data[index]) % MOD_ADLER; > >>>> b = (b + a) % MOD_ADLER; > >>>> } > >>>> > >>>> return (b << 16) | a; > >>>> } > >>>> > >>>> SVE vector method > >>>> ================= > >>>> > >>>> Step 1. Determine the block size: > >>>> Use addvl instruction to get SVE bit width. > >>>> Assuming the SVE bit width is x here. > >>>> > >>>> Step 2. Start to calculate the first block: > >>>> The calculation formula is: > >>>> A1 = 1 + D1 + D2 + ... + Dx (mod 65521) > >>>> B1 = x*D1 + (x-1)*D2 + ... + Dx + x (mod 65521) > >>>> > >>>> Step 3. Calculate the follow block: > >>>> The calculation formula of A2 is very simple, just add up: > >>>> A2 = A1 + Dx+1 + Dx+2 + ... + D2x (mod 65521) > >>>> > >>>> The calculation formula of B2 is more complicated, because > >>>> the result is related to the length of buf. When calculating > >>>> the B1 block, it is actually assumed that the length is the > >>>> block length x. Now when calculating B2, the length is expanded > >>>> to 2x, so B2 becomes: > >>>> B2 = 2x*D1 + (2x-1)*D2 + ... + (x+1)*Dx + x*D(x+1) + ... + D2x + 2x > >>>> = x*D1 + x*D1 + x*D2 + (x-1)*D2 + ... + x*Dx + Dx + x*1 + x + [x*D(x+1) + (x-1)*D(x+2) + ... + D2x] > >>>> ^^^^ ~~~~ ^^^^ ~~~~~~~~ ^^^^ ~~ ^^^ ~ +++++++++++++++++++++++++++++++++++++ > >>>> Through the above polynomial transformation: > >>>> Symbol "^" represents the <x * A1>; > >>>> Symbol "~" represents the <B1>; > >>>> Symbol "+" represents the next block. > >>>> > >>>> So we can get the method of calculating the next block from > >>>> the previous block(Assume that the first byte number of the > >>>> new block starts from 1): > >>>> An+1 = An + D1 + D2 + ... + Dx (mod 65521) > >>>> Bn+1 = Bn + x*An + x*D1 + (x-1)*D2 + ... + Dx (mod 65521) > >>> > >>> Putting aside people's concerns for the moment, I think this may be > >>> formulated in a slightly more convenient way: > >>> > >>> If > >>> X0, X1, ... are the data bytes > >>> An is 1 + Sum [i=0 .. n-1] Xi > >>> Bn is n + Sum [i=0 .. n-1] (n-i)Xi > >>> = Sum [i=1 .. n] Ai > >> > >> Yes, this can be calculated from the original expression. > >> > >>> > >>> (i.e., An, Bn are the accumulations for the first n bytes here, not the > >>> first n blocks) > >>> > >>> then > >>> > >>> A[n+v] - An = Sum[i=n .. n+v-1] Xi > >>> > >>> B[n+v] - Bn = v + (Sum [i=0 .. n+v-1] (n+v-i) Xi) > >>> - Sum [i=0 .. n-1] (n-i)Xi > >>> > >>> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi) > >>> + (Sum [i=0 .. n-1] (n+v-i) Xi) > >>> - Sum [i=0 .. n-1] (n-i)Xi > >>> > >>> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi) > >>> + Sum [i=0 .. n-1] ((n+v-i) - (n-i)) Xi > >>> > >>> = v + (Sum [i=n .. n+v-1] (n+v-i) Xi) > >>> + vSum [i=0 .. n-1] Xi > >>> > >>> = v + v(An - 1) + Sum [i=n .. n+v-1] (n+v-i) Xi > >>> > >>> = vAn + Sum [i=n .. n+v-1] (n+v-i) Xi > >>> > >>> = vAn + vSum [i=n .. n+v-1] Xi > >>> + Sum [i=n .. n+v-1] (n-i) Xi > >>> > >>> = vAn + vSum [i=n .. n+v-1] Xi > >>> + Sum [i=n .. n+v-1] (n-i) Xi > >>> > >>> = vA[n+v] + Sum [i=n .. n+v-1] (n-i) Xi > >>> > >>> Let j = i - n; then: > >>> > >>> B[n+v] - Bn = vA[n+v] - Sum [j=0 .. v-1] j X[j+n] > >>> > >>> Which gives us a multiplier j that increases with the X[] index. > >> > >> My original approach is to multiply the byte sequence with the **decreasing** > >> sequence. I think the increasing sequence here is more friendly to the trailing > >> bytes of the loop.:-) > >> > >>> > >>> > >>> I think this gives a core loop along the following lines. I don't know > >>> whether this is correct, or whether it works -- but feel free to take > >>> ideas from it if it helps. > >>> > >>> Accumulators are 32 bits. This provides for a fair number of iterations > >>> without overflow, but large input data will still require splitting into > >>> chunks, with modulo reduction steps in between. There are rather a lot > >>> of serial dependencies in the core loop, but since the operations > >>> involved are relatively cheap, this is probably not a significant issue > >>> in practice: the load-to-use dependency is probably the bigger concern. > >>> Pipelined loop unrolling could address these if necessary. > >>> > >>> The horizontal reductions (UADDV) still probably don't need to be done > >>> until after the last chunk. > >>> > >>> > >>> Beware: I wasn't careful with the initial values for Bn / An, so some > >>> additional adjustments might be needed... > >>> > >>> --8<-- > >>> > >>> ptrue pP.s > >>> ptrue pA.s > >>> > >>> mov zA.s, #0 // accumulator for An > >>> mov zB.s, #0 // accumulator for Bn > >>> index zJ.s, #0, #1 // zJ.s = [0, 1, .. V-1] > >>> > >>> mov zV.s, #0 > >>> incw zV.s // zV.s = [V, V, .. V] > >> > >> When I actually tested this code, I found that the byte length of the > >> test must be equal to the vector length divided by 4 (that is, an integer > >> multiple of the number of words in the SVE) and the result is correct. > >> > >> And I think one of the reasons is that the value stored in zV.s needs to be > >> changed in the last cycle. It should be changed to the actual number of > >> bytes remaining in the last cycle. :) > > > > Ah yes, you're quite right about this. In my derivation above, v will > > need to be smaller than the vector length when processing the final, > > partial block (if any). > > > >> > >>> > >>> // where V is number of elements per block > >>> // = the number of 32-bit elements that fit in a Z-register > >>> > >>> add xLimit, xX, xLen > >>> b start > >>> > >>> loop: > >>> ld1b zX.s, pP/z, [xX] > >>> incw xX > >> > >> In order to verify my conjecture, I added a bit of code to here, which is to > >> subtract the corresponding value from zV in the last loop. But it is correct > >> only when the number of bytes is less than one cycle. Test cases that exceed > >> one complete cycle will also fail. > >> > >> So I guess before calculating the last cycle, zA should be summed first, > >> because before the end of the cycle, zA and zB are scattered in the elements > >> of the vector, if the last cycle calculates zB, only part of zA is summed > >> ( Because pP/m does not count inactive elements), it should be incomplete. > >> > >> This is just my guess and has not yet been verified.:) > > > > I think zA shouldn't be wrong, since that only accumulates active > > elements anyway. I think it's the bogus zV multiplier involved in the > > update to zB that is the problem. > > > >>> > >>> add zA.s, pP/m, zA.s, zX.s // zA.s += zX.s > >>> > >>> msb zX.s, pP/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s > >>> > >>> movprfx zB, zA > >>> mad zB.s, pP/m, zV.s, zX.s // zB.s := zX.s + zA.s * V > > I found the bug I encountered earlier, that is, the calculation of zB here > needs to use pA with all elements activated. The reason is the same as my > previous guess, because all elements of zA should be involved when calculating zB. > Because the original calculation formula is like this. > > For example: > In the last loop: > left byte is: 3 | 4 | \0 | > zA.s is: 100 | 200 | 100 | 200 (sum = 600) > pP.s is: 1 | 1 | 0 | 0 (Only activate the first 2 channels) > > At this time, if the calculation of zB only takes the first 2 active elements, the data > is incomplete, because according to the description of the original algorithm, zB is always > based on the sum of all the accumulated bytes. Yes, you're quite right here: zX is partial: only the elements pP are valid; but all elements of zA and zB are always valid. I was focusing too much on the handling of the partial input block. > > Here we can simply change the prediction register used in the two sentences related to > zB to the one that is all true (it is pA in our code), like this: > msb zX.s, pA/m, zJ.s, zB.s // zX.s := zB.s - zX.s * zJ.s Are you sure about this? In a final partial block, the trailing elements of zX.s beyond pP will be leftover junk from the last iteration. This might require a bit more thought in order to get the final block handling correct. > movprfx zB, zA > mad zB.s, pA/m, zV.s, zX.s // zB.s := zX.s + zA.s * V > > >>> start: > >>> whilelo pP.s, xX, xLimit > >>> b.first loop > > > > There are a few options, I guess. > > > > One way is to use CNTP inside the loop to get the number of active > > elements, instead of just setting zV in advance. This approach may add > > a slight overhead, but it is worth experimenting with it. If the > > overhead is neglibible, then this approach has the example of being > > simple to understand. > > > > Alternatively, you could do an adjustment after the loop ends to > > subtract the appropriate amounts from zB. Unfortunately, pP has already > > been overwritten by the time the loop ends. If could be backed up into > > another P-register before overwriting it: this should be pretty low- > > overhead. > > > > A final option would be to change the b.first to a b.last, so that the > > loop ends after the final full block. Then copy-paste the loop body to > > execute once more, but using CNTP to get the element count to multiply > > by, as described above. This makes the code a bit larger, but it > > probably the best-performance option. You may be able to rotate the > > loop so that it breaks out after updating zA (which IIUC doesn't need to > > be done in a special way for the partial block). This would mean you > > only have to specialise the zB update code. > > Regarding the last loop, I tried to wrap another layer before and after the > core loop, and I have verified its correctness. > My code is at the end of this email. > > > > >>> > >>> // Collect the partial sums together: > >>> > >>> uaddv d0, pA, z0.s > >>> uaddv d1, pA, z1.s > >>> > >>> // Finally, add 1 to d0, and xLen to d1, and do modulo reduction. > >>> > >>> -->8-- > >>> > >>> [...] > >>> > >>> Cheers > >>> ---Dave > >>> . > >>> > >> > >> The code you sent me provides a correct way to deal with trailing bytes, > >> I need to spend some more time to debug the problem encountered above. > >> Thank you! > >> > >> Cheers.^_^ > > > > I was cheekily leaving the testing and debugging for you to do :) > > > > Part of my reason for interest in this is that if we enable SVE in the > > kernel, it would be good to have some clean examples for people to > > follow -- even if not this algo. SVE is a bit different to use > > compared with fixed-length vector ISAs, so worked examples are always > > useful. > Yes it is necessary. > > > > > Cheers > > ---Dave > > . > > > > This is the main part of my newly modified code: > --8<-- > ptrue p0.s > ptrue p1.s > > mov zA.s, #0 > mov zB.s, #0 > index zJ.s, #0, #1 > mov zV.s, #0 > incw zV.s > > add xLimit, x1, x2 > b start //Enter the core loop first > > trailloop: // Last cycle entrance > cntp x6, p1, p0.s // Get active element count of last cycle > cpy zV.s, p1/m, w6 // Set zV to the actual value. Note that you can also write "mov" here, but I'm not sure which alias is preferred. > loop: // Core loop entrance > ld1b zX.s, p0/z, [x1] > incw x1 > > add zA.s, p0/m, zA.s, zX.s // The calculation of zA still needs to use p0 > msb zX.s, p1/m, zJ.s, zB.s // Change p register here > movprfx zB, zA > mad zB.s, p1/m, zV.s, zX.s // Change p register here As discussed above, are you sure this is correct now? > start: > whilelo p0.s, x1, xLimit > b.last loop // The condition for the core loop to continue is that b.last is true > b.first trailloop // If b.last is false and b.first is true, it means the last cycle > > uaddv d0, p1, zA.s > uaddv d1, p1, zB.s > > mov x12, v0.2d[0] > mov x13, v1.2d[0] The "2" here seems not to be required by the syntax, although it's harmless. > add x10, x10, x12 > add x11, x11, x13 > add x11, x11, x2 If x10 and x11 are used as accmulators by the caller, I guess this works. > > mod65521 10, 14, 12 > mod65521 11, 14, 12 > lsl x11, x11, #16 > orr x0, x10, x11 > ret > -->8-- > > After this modification, The test results are correct when the data length is less than about 8 Kbyte, > part A will still be correct after 8K, and an overflow error will occur in part B. This is because A > only accumulates all the bytes, and the accumulative acceleration of B expands faster, because the > accumulative formula of B is: > B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521) > = n×D1 + (n−1)×D2 + (n−2)×D3 + ... + Dn + n (mod 65521) > > If we take the average value of Dx to 128 and n to 8192: > B = (1 + 2 + ... + 8129) * 128 + 8192 > = 4,295,499,776 (32bit overflow) > > So I think the 32-bit accumulator is still not enough for part B here. :) > > -- > Best regards, > Li Qiang That makes sense. I hadn't tried to calculate the actual bound. It may be worth trying this with 64-bit accumulators. This will probably slow things down, but it depends on the relative computation / memory throughput exhibited by the hardware. I think the code can't be bulletproof without breaking the input into chunks anyway, though. Cheers ---Dave