在 2020/11/16 23:56, Dave Martin 写道: >> --8<-- >> ... >> adler_A .req x10 >> adler_B .req x11 >> >> .macro adler32_core >> ld1b zX.h, p0/z, [x1] // load bytes >> inch x1 >> >> uaddv d0, p0, zX.h >> mul zX.h, p0/m, zX.h, zJ.h // Sum [j=0 .. v-1] j*X[j+n] >> mov x9, v0.d[0] >> uaddv d1, p0, zX.h >> add adler_A, adler_A, x9 // A[n+v] = An + Sum [j=0 ... v-1] X[j] >> mov x9, v1.d[0] >> madd adler_B, x7, adler_A, adler_B // Bn + v*A[n+v] >> sub adler_B, adler_B, x9 // B[n+v] = Bn + v*A[n+v] - Sum [j=0 .. v-1] j*X[j+n] >> .endm > If this has best performance, I find that quite surprising. Those uaddv > instructions will stop the vector lanes flowing independently inside the > loop, so if an individual element load is slow arriuaddving then everything > will have to wait. I don't know much about this problem, do you mean that the uaddv instruction used in the loop has a great impact on performance? > > A decent hardware prefetcher may tend to hide that issue for sequential > memory access, though: i.e., if the hardware does a decent job of > fetching data before the actual loads are issued, the data may appear to > arrive with minimal delay. > > The effect might be a lot worse for algorithms that have less > predictable memory access patterns. > > Possibly you do win some additional performance due to processing twice > as many elements at once, here. I think so. Compared to loading bytes into zX.h, if you load them directly into zX.b, and then use uunpklo/uunpkhi for register expansion, the performance will be more better(20% faster). This may be the reason. -- Best regards, Li Qiang