Re: [PATCH 0/2] xor: enable auto-vectorization in Clang

Nathan Chancellor <nathan@xxxxxxxxxx> · Thu, 27 Jan 2022 14:36:57 -0700

Hi Ard,

On Thu, Jan 27, 2022 at 09:12:25AM +0100, Ard Biesheuvel wrote:
> Update the xor_blocks() prototypes so that the compiler understands that
> the inputs always refer to distinct regions of memory. This is implied
> by the existing implementations, as they use different granularities for
> the load/xor/store loops.
> 
> With that, we can fix the ARM/Clang version, which refuses to SIMD
> vectorize otherwise, and throws a spurious warning related to the GCC
> version being incompatible.
> 
> Cc: Nick Desaulniers <ndesaulniers@xxxxxxxxxx>
> Cc: Nathan Chancellor <nathan@xxxxxxxxxx>
> 
> Ard Biesheuvel (2):
>   lib/xor: make xor prototypes more friendely to compiler vectorization
>   crypto: arm/xor - make vectorized C code Clang-friendly

I tested multi_v7_defconfig + CONFIG_BTRFS=y (to get
CONFIG_XOR_BLOCKS=y) in QEMU 6.2.0 (10 boots) and the xor neon code gets
faster according to do_xor_speed():

mainline @ 626b2dda7651:

[    2.591449]    neon            :  1166 MB/sec
[    2.579454]    neon            :  1118 MB/sec
[    2.589061]    neon            :  1163 MB/sec
[    2.581827]    neon            :  1167 MB/sec
[    2.599079]    neon            :  1166 MB/sec
[    2.579252]    neon            :  1147 MB/sec
[    2.582637]    neon            :  1168 MB/sec
[    2.582872]    neon            :  1164 MB/sec
[    2.570671]    neon            :  1167 MB/sec
[    2.571830]    neon            :  1166 MB/sec

mainline @ 626b2dda7651 with series:

[    2.570227]    neon            :  1238 MB/sec
[    2.571642]    neon            :  1237 MB/sec
[    2.580370]    neon            :  1234 MB/sec
[    2.581966]    neon            :  1238 MB/sec
[    2.582313]    neon            :  1236 MB/sec
[    2.572291]    neon            :  1238 MB/sec
[    2.570625]    neon            :  1233 MB/sec
[    2.571897]    neon            :  1234 MB/sec
[    2.589616]    neon            :  1228 MB/sec
[    2.582449]    neon            :  1236 MB/sec

This series is currently broken for powerpc [1], as the functions in
arch/powerpc/lib/xor_vmx.c were not updated.

arch/powerpc/lib/xor_vmx.c:52:6: error: conflicting types for '__xor_altivec_2'
void __xor_altivec_2(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:9:6: note: previous declaration is here
void __xor_altivec_2(unsigned long bytes, unsigned long * __restrict p1,
     ^
arch/powerpc/lib/xor_vmx.c:70:6: error: conflicting types for '__xor_altivec_3'
void __xor_altivec_3(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:11:6: note: previous declaration is here
void __xor_altivec_3(unsigned long bytes, unsigned long * __restrict p1,
     ^
arch/powerpc/lib/xor_vmx.c:92:6: error: conflicting types for '__xor_altivec_4'
void __xor_altivec_4(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:14:6: note: previous declaration is here
void __xor_altivec_4(unsigned long bytes, unsigned long * __restrict p1,
     ^
arch/powerpc/lib/xor_vmx.c:119:6: error: conflicting types for '__xor_altivec_5'
void __xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
     ^
arch/powerpc/lib/xor_vmx.h:18:6: note: previous declaration is here
void __xor_altivec_5(unsigned long bytes, unsigned long * __restrict p1,
     ^
4 errors generated.

If I fix that up [2], it builds and resolves an instance of
-Wframe-larger-than= in the xor altivec code, as seen with
pmac32_defconfig.

Before this series:

arch/powerpc/lib/xor_vmx.c:119:6: error: stack frame size (1232) exceeds limit (1024) in '__xor_altivec_5' [-Werror,-Wframe-larger-than]
void __xor_altivec_5(unsigned long bytes, unsigned long *v1_in,
     ^
1 error generated.

After this patch (with CONFIG_FRAME_WARN=100 and
CONFIG_PPC_DISABLE_WERROR=y):

arch/powerpc/lib/xor_vmx.c:52:6: warning: stack frame size (128) exceeds limit (100) in '__xor_altivec_2' [-Wframe-larger-than]
void __xor_altivec_2(unsigned long bytes,
     ^
arch/powerpc/lib/xor_vmx.c:71:6: warning: stack frame size (160) exceeds limit (100) in '__xor_altivec_3' [-Wframe-larger-than]
void __xor_altivec_3(unsigned long bytes,
     ^
arch/powerpc/lib/xor_vmx.c:95:6: warning: stack frame size (144) exceeds limit (100) in '__xor_altivec_4' [-Wframe-larger-than]
void __xor_altivec_4(unsigned long bytes,
     ^
arch/powerpc/lib/xor_vmx.c:124:6: warning: stack frame size (160) exceeds limit (100) in '__xor_altivec_5' [-Wframe-larger-than]
void __xor_altivec_5(unsigned long bytes,
     ^
4 warnings generated.

There is a similar performance gain as ARM according to do_xor_speed():

Before:

   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   219 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec
   altivec         :   222 MB/sec

After:

   altivec         :   278 MB/sec
   altivec         :   276 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec
   altivec         :   278 MB/sec

I did also build test arm64 and x86_64 and saw no errors. I did runtime
test arm64 for improvements and did not see any, which is good, since I
take that as meaning it was working fine before and there is no
regression.

Once the build error is fixed, consider this series:

Tested-by: Nathan Chancellor <nathan@xxxxxxxxxx>

[1]: https://lore.kernel.org/r/202112310646.kuh2pXiG-lkp@xxxxxxxxx/
[2]: https://github.com/ClangBuiltLinux/linux/issues/563#issuecomment-1005175153

Cheers,
Nathan