On 18 June 2018 at 23:56, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > On Sun, Jun 17, 2018 at 01:10:41PM +0200, Ard Biesheuvel wrote: >> >>>>> + >> >>>>> + // One-time XTS preparation >> >>>>> + >> >>>>> + /* >> >>>>> + * Allocate stack space to store 128 bytes worth of tweaks. For >> >>>>> + * performance, this space is aligned to a 16-byte boundary so that we >> >>>>> + * can use the load/store instructions that declare 16-byte alignment. >> >>>>> + */ >> >>>>> + sub sp, #128 >> >>>>> + bic sp, #0xf >> >>>> >> >>>> >> >>>> This fails here when building with CONFIG_THUMB2_KERNEL=y >> >>>> >> >>>> AS arch/arm/crypto/speck-neon-core.o >> >>>> >> >>>> arch/arm/crypto/speck-neon-core.S: Assembler messages: >> >>>> >> >>>> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here -- >> >>>> `bic sp,#0xf' >> >>>> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here -- >> >>>> `bic sp,#0xf' >> >>>> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here -- >> >>>> `bic sp,#0xf' >> >>>> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here -- >> >>>> `bic sp,#0xf' >> >>>> >> >>>> In a quick hack this change seems to address it: >> >>>> >> >>>> >> >>>> - sub sp, #128 >> >>>> - bic sp, #0xf >> >>>> + mov r6, sp >> >>>> + sub r6, #128 >> >>>> + bic r6, #0xf >> >>>> + mov sp, r6 >> >>>> >> >>>> But there is probably a better solution to address this. >> >>>> >> >>> >> >>> Given that there is no NEON on M class cores, I recommend we put something like >> >>> >> >>> THUMB(bx pc) >> >>> THUMB(nop.w) >> >>> THUMB(.arm) >> >>> >> >>> at the beginning and be done with it. >> >> >> >> I mean nop.n or just nop, of course, and we may need a '.align 2' at >> >> the beginning as well. >> > >> > Wouldn't it be preferable to have it assemble it in Thumb2 too? It seems >> > that bic sp,#0xf is the only issue... >> > >> >> Well, in general, yes. In the case of NEON code, not really, since the >> resulting code will not be smaller anyway, because the Thumb2 NEON >> opcodes are all 4 bytes. Also, Thumb2-only cores don't have NEON >> units, so all cores that this code can run on will be able to run in >> ARM mode. >> >> So from a maintainability pov, having code that only assembles in one >> way is better than having code that must compile both to ARM and to >> Thumb2 opcodes. >> >> Just my 2 cents, anyway. > > I don't have too much of a preference, though Stefan's suggested 4 instructions > can be reduced to 3, which also matches what aes-neonbs-core.S does: > > sub r12, sp, #128 > bic r12, #0xf > mov sp, r12 > > Ard, is the following what you're suggesting instead? > Yes, but after looking at the actual code, I prefer the change above. The access occurs only once, not in the loop so the additional instructions should not affect performance. Apologies for the noise. > diff --git a/arch/arm/crypto/speck-neon-core.S b/arch/arm/crypto/speck-neon-core.S > index 3c1e203e53b9..c989ce3dc057 100644 > --- a/arch/arm/crypto/speck-neon-core.S > +++ b/arch/arm/crypto/speck-neon-core.S > @@ -8,6 +8,7 @@ > */ > > #include <linux/linkage.h> > +#include <asm/assembler.h> > > .text > .fpu neon > @@ -233,6 +234,12 @@ > * nonzero multiple of 128. > */ > .macro _speck_xts_crypt n, decrypting > + > + .align 2 > + THUMB(bx pc) > + THUMB(nop) > + THUMB(.arm) > + > push {r4-r7} > mov r7, sp > > @@ -413,6 +420,8 @@ > mov sp, r7 > pop {r4-r7} > bx lr > + > + THUMB(.thumb) > .endm > > ENTRY(speck128_xts_encrypt_neon)