Hi Ard, On Fri, Aug 31, 2018 at 05:56:24PM +0200, Ard Biesheuvel wrote: > Hi Eric, > > On 31 August 2018 at 10:01, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > From: Eric Biggers <ebiggers@xxxxxxxxxx> > > > > Optimize ChaCha20 NEON performance by: > > > > - Implementing the 8-bit rotations using the 'vtbl.8' instruction. > > - Streamlining the part that adds the original state and XORs the data. > > - Making some other small tweaks. > > > > On ARM Cortex-A7, these optimizations improve ChaCha20 performance from > > about 11.9 cycles per byte to 11.3. > > > > There is a tradeoff involved with the 'vtbl.8' rotation method since > > there is at least one CPU where it's not fastest. But it seems to be a > > better default; see the added comment. > > > > Signed-off-by: Eric Biggers <ebiggers@xxxxxxxxxx> > > --- > > arch/arm/crypto/chacha20-neon-core.S | 289 ++++++++++++++------------- > > 1 file changed, 147 insertions(+), 142 deletions(-) > > > > diff --git a/arch/arm/crypto/chacha20-neon-core.S b/arch/arm/crypto/chacha20-neon-core.S > > index 3fecb2124c35a..d381cebaba31d 100644 > > --- a/arch/arm/crypto/chacha20-neon-core.S > > +++ b/arch/arm/crypto/chacha20-neon-core.S > > @@ -18,6 +18,33 @@ > > * (at your option) any later version. > > */ > > > > + /* > > + * NEON doesn't have a rotate instruction. The alternatives are, more or less: > > + * > > + * (a) vshl.u32 + vsri.u32 (needs temporary register) > > + * (b) vshl.u32 + vshr.u32 + vorr (needs temporary register) > > + * (c) vrev32.16 (16-bit rotations only) > > + * (d) vtbl.8 + vtbl.8 (multiple of 8 bits rotations only, > > + * needs index vector) > > + * > > + * ChaCha20 has 16, 12, 8, and 7-bit rotations. For the 12 and 7-bit > > + * rotations, the only choices are (a) and (b). We use (a) since it takes > > + * two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53. > > + * > > + * For the 16-bit rotation, we use vrev32.16 since it's consistently fastest > > + * and doesn't need a temporary register. > > + * > > + * For the 8-bit rotation, we use vtbl.8 + vtbl.8. On Cortex-A7, this sequence > > + * is twice as fast as (a), even when doing (a) on multiple registers > > + * simultaneously to eliminate the stall between vshl and vsri. Also, it > > + * parallelizes better when temporary registers are scarce. > > + * > > + * A disadvantage is that on Cortex-A53, the vtbl sequence is the same speed as > > + * (a), so the need to load the rotation table actually makes the vtbl method > > + * slightly slower overall on that CPU. Still, it seems to be a good > > + * compromise to get a significant speed boost on some CPUs. > > + */ > > + > > Thanks for sharing these results. I have been working on 32-bit ARM > code under the assumption that the A53 pipeline more or less resembles > the A7 one, but this is obviously not the case looking at your > results. My contributions to arch/arm/crypto mainly involved Crypto > Extensions code, which the A7 does not support in the first place, so > it does not really matter, but I will keep this in mind going forward. > > > #include <linux/linkage.h> > > > > .text > > @@ -46,6 +73,9 @@ ENTRY(chacha20_block_xor_neon) > > vmov q10, q2 > > vmov q11, q3 > > > > + ldr ip, =.Lrol8_table > > + vld1.8 {d10}, [ip, :64] > > + > > I usually try to avoid the =literal ldr notation, because it involves > an additional load via the D-cache. Could you use a 64-bit literal > instead of a byte array and use vldr instead? Or switch to adr? (and > move the literal in range, I suppose) 'adr' works if I move rol8_table to between chacha20_block_xor_neon() and chacha20_4block_xor_neon(). > > ENTRY(chacha20_4block_xor_neon) > > - push {r4-r6, lr} > > - mov ip, sp // preserve the stack pointer > > - sub r3, sp, #0x20 // allocate a 32 byte buffer > > - bic r3, r3, #0x1f // aligned to 32 bytes > > - mov sp, r3 > > + push {r4} > > The ARM EABI mandates 8 byte stack alignment, and if you take an > interrupt right at this point, you will enter the interrupt handler > with a misaligned stack. Whether this could actually cause any > problems is a different question, but it is better to keep it 8-byte > aligned to be sure. > > I'd suggest you just push r4 and lr, so you can return from the > function by popping r4 and pc, as is customary on ARM. Are you sure that's an actual constraint? That would imply you could never push/pop an odd number of ARM registers to/from the stack... but if I disassemble an ARM kernel, such pushes/pops occur frequently in generated code. So 8-byte alignment must only be required when a subroutine is called. If I understand the interrupt handling code correctly, the kernel stack is being aligned in the svc_entry macro from __irq_svc in arch/arm/kernel/entry-armv.S: .macro svc_entry, stack_hole=0, trace=1, uaccess=1 UNWIND(.fnstart ) UNWIND(.save {r0 - pc} ) sub sp, sp, #(SVC_REGS_SIZE + \stack_hole - 4) #ifdef CONFIG_THUMB2_KERNEL SPFIX( str r0, [sp] ) @ temporarily saved SPFIX( mov r0, sp ) SPFIX( tst r0, #4 ) @ test original stack alignment SPFIX( ldr r0, [sp] ) @ restored #else SPFIX( tst sp, #4 ) #endif SPFIX( subeq sp, sp, #4 ) So 'sp' must always be 4-byte aligned, but not necessarily 8-byte aligned. Anyway, it may be a moot point as I'm planning to use r5 in the v2 patch, so I'll just be pushing {r4-r5} anyway... > > > + mov r4, sp // preserve the stack pointer > > + sub ip, sp, #0x20 // allocate a 32 byte buffer > > + bic ip, ip, #0x1f // aligned to 32 bytes > > + mov sp, ip > > > > Is there a reason for preferring ip over r3 for preserving the > original value of sp? You mean preferring r4 over ip? It's mostly arbitrary which register is assigned to what; I just like using 'ip' for shorter-lived temporary stuff. > > + > > + // x8..11[0-3] += s8..11[0-3] (add orig state to 3rd row of each block) > > + vadd.u32 q8, q8, q0 > > + vadd.u32 q10, q10, q0 > > + vld1.32 {q14-q15}, [sp, :256] > > + vadd.u32 q9, q9, q0 > > + adr ip, .Lctrinc2 > > + vadd.u32 q11, q11, q0 > > + > > + // x12..15[0-3] += s12..15[0-3] (add orig state to 4th row of each block) > > + vld1.32 {q0}, [ip, :128] // load (1, 0, 2, 0) > > Would something like > > vmov.i32 d0, #1 > vmovl.u32 q0, d0 > vadd.u64 d1, d1 > > (untested) work as well? > > > + vadd.u32 q15, q15, q1 > > + vadd.u32 q13, q13, q1 > > + vadd.u32 q14, q14, q1 > > + vadd.u32 q12, q12, q1 > > + vadd.u32 d30, d30, d1 // x12[3] += 2 > > + vadd.u32 d26, d26, d1 // x12[2] += 2 > > + vadd.u32 d28, d28, d0 // x12[1] += 1 > > + vadd.u32 d30, d30, d0 // x12[3] += 1 > > + Well, or: vmov.u64 d0, #0x00000000ffffffff vadd.u32 d1, d0, d0 And then subtract instead of add. But actually there's a better way: load {0, 1, 2, 3} (i.e. the original 'CTRINC') and add it to q12 *before* re-interleaving q12-15. That avoids unnecessary additions. I'll send a new patch with that. - Eric