On 05.06.2013 11:34, Herbert Xu wrote: > On Sun, Jun 02, 2013 at 07:51:52PM +0300, Jussi Kivilinna wrote: >> It appears that the performance of 'vpgatherdd' is suboptimal for this kind of >> workload (tested on Core i5-4570) and causes blowfish-avx2 to be significantly >> slower than blowfish-amd64. So disable the AVX2 implementation to avoid >> performance regressions. >> >> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@xxxxxx> > > Both patches applied to crypto. I presume you're working on > a more permanent solution on this? Yes, I've been looking for solution. Problem is, well, that I assumed vgather to be quicker than emulating gather using vpextr/vpinsr instructions. But it appears that vgather has about the same speed as group of vpextr/vpinsr doing gather manually. So doing asm volatile( "vpgatherdd %%xmm0, (%[ptr], %%xmm8, 4), %%xmm9; \n\t" "vpcmpeqd %%xmm0, %%xmm0, %%xmm0; /* reset mask */ \n\t" "vpgatherdd %%xmm0, (%[ptr], %%xmm9, 4), %%xmm8; \n\t" "vpcmpeqd %%xmm0, %%xmm0, %%xmm0; \n\t" :: [ptr] "r" (&mem[0]) : "memory" ); in loop is slightly _slower_ than manually extracting&inserting values with asm volatile( "vmovd %%xmm8, %%eax; \n\t" "vpextrd $1, %%xmm8, %%edx; \n\t" "vmovd (%[ptr], %%rax, 4), %%xmm10; \n\t" "vpextrd $2, %%xmm8, %%eax; \n\t" "vpinsrd $1, (%[ptr], %%rdx, 4), %%xmm10, %%xmm10; \n\t" "vpextrd $3, %%xmm8, %%edx; \n\t" "vpinsrd $2, (%[ptr], %%rax, 4), %%xmm10, %%xmm10; \n\t" "vpinsrd $3, (%[ptr], %%rdx, 4), %%xmm10, %%xmm9; \n\t" "vmovd %%xmm9, %%eax; \n\t" "vpextrd $1, %%xmm9, %%edx; \n\t" "vmovd (%[ptr], %%rax, 4), %%xmm10; \n\t" "vpextrd $2, %%xmm9, %%eax; \n\t" "vpinsrd $1, (%[ptr], %%rdx, 4), %%xmm10, %%xmm10; \n\t" "vpextrd $3, %%xmm9, %%edx; \n\t" "vpinsrd $2, (%[ptr], %%rax, 4), %%xmm10, %%xmm10; \n\t" "vpinsrd $3, (%[ptr], %%rdx, 4), %%xmm10, %%xmm8; \n\t" :: [ptr] "r" (&mem[0]) : "memory", "eax", "edx" ); vpextr/vpinsr cannot be used with 256-bit wide ymm registers, so 'vinserti128/vextracti128' is needed and make manual gather about the same speed as vpgatherdd. Now the block cipher implementations need to use all bytes of vector register for table look-ups, and the way that this is done in the AVX implementation of Twofish (move data from vector register to generic purpose registers, handle byte-extraction and table look-ups there and move processed data back to vector register) is about two to three times faster than the way with current AVX2 implementation using vgather. Blowfish does not do much processing in addition to table look-ups, so there is not much to that can be done. With Twofish, the table look-ups are the most computationally heavy part and I don't think that the wider vector registers in the other parts are going to give much boost. So permanent solution is likely to be revert. -Jussi > > Thanks, > -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html