On Fri, 4 Oct 2019 at 17:15, René van Dorst <opensource@xxxxxxxxxx> wrote: > > Hi Jason, > > Quoting "Jason A. Donenfeld" <Jason@xxxxxxxxx>: > > > On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel > > <ard.biesheuvel@xxxxxxxxxx> wrote: > >> The round count is passed via the fifth function parameter, so it is > >> already on the stack. Reloading it for every block doesn't sound like > >> a huge deal to me. > > > > Please benchmark it to indicate that, if it really isn't a big deal. I > > recall finding that memory accesses on common mips32r2 commodity > > router hardware was extremely inefficient. The whole thing is designed > > to minimize memory accesses, which are the primary bottleneck on that > > platform. > > I also think it isn't a big deal, but I shall benchmark it this weekend. > If I am correct a memory write will first put in cache. So if you read > it again and it is in cache it is very fast. 1 or 2 clockcycles. > Also the value isn't used directly after it is read. > So cpu don't have to stall on this read. > Thanks René. Note that the round count is not being spilled. I [re]load it from the stack as a function parameter. So instead of li $at, 20 I do lw $at, 16($sp) Thanks a lot for taking the time to double check this. I think it would be nice to be able to expose xchacha12 like we do on other architectures. Note that for xchacha, I also added a hchacha_block() routine based on your code (with the round count as the third argument) [0]. Please let me know if you see anything wrong with that. +.globl hchacha_block +.ent hchacha_block +hchacha_block: + .frame $sp, STACK_SIZE, $ra + + addiu $sp, -STACK_SIZE + + /* Save s0-s7 */ + sw $s0, 0($sp) + sw $s1, 4($sp) + sw $s2, 8($sp) + sw $s3, 12($sp) + sw $s4, 16($sp) + sw $s5, 20($sp) + sw $s6, 24($sp) + sw $s7, 28($sp) + + lw X0, 0(STATE) + lw X1, 4(STATE) + lw X2, 8(STATE) + lw X3, 12(STATE) + lw X4, 16(STATE) + lw X5, 20(STATE) + lw X6, 24(STATE) + lw X7, 28(STATE) + lw X8, 32(STATE) + lw X9, 36(STATE) + lw X10, 40(STATE) + lw X11, 44(STATE) + lw X12, 48(STATE) + lw X13, 52(STATE) + lw X14, 56(STATE) + lw X15, 60(STATE) + +.Loop_hchacha_xor_rounds: + addiu $a2, -2 + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12); + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7); + bnez $a2, .Loop_hchacha_xor_rounds + + sw X0, 0(OUT) + sw X1, 4(OUT) + sw X2, 8(OUT) + sw X3, 12(OUT) + sw X12, 16(OUT) + sw X13, 20(OUT) + sw X14, 24(OUT) + sw X15, 28(OUT) + + /* Restore used registers */ + lw $s0, 0($sp) + lw $s1, 4($sp) + lw $s2, 8($sp) + lw $s3, 12($sp) + lw $s4, 16($sp) + lw $s5, 20($sp) + lw $s6, 24($sp) + lw $s7, 28($sp) + + addiu $sp, STACK_SIZE + jr $ra +.end hchacha_block +.set at [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=wireguard-crypto-library-api-v3&id=cc74a037f8152d52bd17feaf8d9142b61761484f