On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote: > >>>>> + > >>>>> + // One-time XTS preparation > >>>>> + > >>>>> + /* > >>>>> + * Allocate stack space to store 128 bytes worth of tweaks. For > >>>>> + * performance, this space is aligned to a 16-byte boundary so that we > >>>>> + * can use the load/store instructions that declare 16-byte alignment. > >>>>> + */ > >>>>> + sub sp, #128 > >>>>> + bic sp, #0xf > >>>> > >>>> > >>>> This fails here when building with CONFIG_THUMB2_KERNEL=y > >>>> > >>>> AS arch/arm/crypto/speck-neon-core.o > >>>> > >>>> arch/arm/crypto/speck-neon-core.S: Assembler messages: > >>>> > >>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here -- > >>>> `bic sp,#0xf' > >>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here -- > >>>> `bic sp,#0xf' > >>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here -- > >>>> `bic sp,#0xf' > >>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here -- > >>>> `bic sp,#0xf' > >>>> > >>>> In a quick hack this change seems to address it: > >>>> > >>>> > >>>> - sub sp, #128 > >>>> - bic sp, #0xf > >>>> + mov r6, sp > >>>> + sub r6, #128 > >>>> + bic r6, #0xf > >>>> + mov sp, r6 > >>>> > >>>> But there is probably a better solution to address this. > >>>> > >>> > >>> Given that there is no NEON on M class cores, I recommend we put something like > >>> > >>> THUMB(bx pc) > >>> THUMB(nop.w) > >>> THUMB(.arm) > >>> > >>> at the beginning and be done with it. > >> > >> I mean nop.n or just nop, of course, and we may need a '.align 2' at > >> the beginning as well. > > > > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems > > that bic sp,#0xf is the only issue... > > > > Well, in general, yes. In the case of NEON code, not really, since the > resulting code will not be smaller anyway, because the Thumb2 NEON > opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON > units, so all cores that this code can run on will be able to run in > ARM mode. > > So from a maintainability pov, having code that only assembles in one > way is better than having code that must compile both to ARM and to > Thumb2 opcodes. > > Just my 2 cents, anyway. I don't have too much of a preference, though Stefan's suggested 4 instructions can be reduced to 3, which also matches what aes-neonbs-core.S does: sub r12, sp, #128 bic r12, #0xf mov sp, r12 Ard, is the following what you're suggesting instead? diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S index 3c1e203e53b9..c989ce3dc057 100644 --- a/arch/arm/crypto/speck-neon-core.S +++ b/arch/arm/crypto/speck-neon-core.S @@ -8,6 +8,7 @@ */ #include <linux/linkage.h> +#include <asm/assembler.h> .text .fpu neon @@ -233,6 +234,12 @@ * nonzero multiple of 128. */ .macro _speck_xts_crypt n, decrypting + + .align 2 + THUMB(bx pc) + THUMB(nop) + THUMB(.arm) + push {r4-r7} mov r7, sp @@ -413,6 +420,8 @@ mov sp, r7 pop {r4-r7} bx lr + + THUMB(.thumb) .endm ENTRY(speck128_xts_encrypt_neon)