Hi Ard and Jason, Quoting Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>:
On Fri, 4 Oct 2019 at 17:15, René van Dorst <opensource@xxxxxxxxxx> wrote:Hi Jason, Quoting "Jason A. Donenfeld" <Jason@xxxxxxxxx>: > On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel > <ard.biesheuvel@xxxxxxxxxx> wrote: >> The round count is passed via the fifth function parameter, so it is >> already on the stack. Reloading it for every block doesn't sound like >> a huge deal to me. > > Please benchmark it to indicate that, if it really isn't a big deal. I > recall finding that memory accesses on common mips32r2 commodity > router hardware was extremely inefficient. The whole thing is designed > to minimize memory accesses, which are the primary bottleneck on that > platform. I also think it isn't a big deal, but I shall benchmark it this weekend. If I am correct a memory write will first put in cache. So if you read it again and it is in cache it is very fast. 1 or 2 clockcycles. Also the value isn't used directly after it is read. So cpu don't have to stall on this read.Thanks René. Note that the round count is not being spilled. I [re]load it from the stack as a function parameter. So instead of li $at, 20 I do lw $at, 16($sp) Thanks a lot for taking the time to double check this. I think it would be nice to be able to expose xchacha12 like we do on other architectures.
I dust off my old benchmark code and put it on top of latest WireGuard source [0]. It benchmarks the chacha20poly1305_{de,en}crypt functions with different data block sizes (x bytes).It runs two tests, first one is see how many runs we get in 1 second results in
MB/Sec and other one measures the used cpu cycles per loop. The test is preformed on a Mediatek MT7621A SoC running at 880MHz. Baseline [1]: root@OpenWrt:~# insmod wg-speed-baseline.ko [ 2029.866393] wireguard: chacha20 self-tests: pass [ 2029.894301] wireguard: poly1305 self-tests: pass [ 2029.906428] wireguard: chacha20poly1305 self-tests: pass[ 2030.121001] wireguard: chacha20poly1305_encrypt: 1 bytes, 0.253 MB/sec, 1598 cycles [ 2030.340786] wireguard: chacha20poly1305_encrypt: 16 bytes, 4.178 MB/sec, 1554 cycles [ 2030.561434] wireguard: chacha20poly1305_encrypt: 64 bytes, 15.392 MB/sec, 1692 cycles [ 2030.784635] wireguard: chacha20poly1305_encrypt: 128 bytes, 22.106 MB/sec, 2381 cycles [ 2031.081534] wireguard: chacha20poly1305_encrypt: 1420 bytes, 35.480 MB/sec, 16751 cycles [ 2031.371369] wireguard: chacha20poly1305_encrypt: 1440 bytes, 36.117 MB/sec, 16712 cycles [ 2031.589621] wireguard: chacha20poly1305_decrypt: 1 bytes, 0.246 MB/sec, 1648 cycles [ 2031.809392] wireguard: chacha20poly1305_decrypt: 16 bytes, 4.064 MB/sec, 1598 cycles [ 2032.030034] wireguard: chacha20poly1305_decrypt: 64 bytes, 14.990 MB/sec, 1738 cycles [ 2032.253245] wireguard: chacha20poly1305_decrypt: 128 bytes, 21.679 MB/sec, 2428 cycles [ 2032.540150] wireguard: chacha20poly1305_decrypt: 1420 bytes, 35.480 MB/sec, 16793 cycles [ 2032.829954] wireguard: chacha20poly1305_decrypt: 1440 bytes, 35.979 MB/sec, 16756 cycles
[ 2032.850563] wireguard: blake2s self-tests: pass [ 2033.073767] wireguard: curve25519 self-tests: pass [ 2033.083600] wireguard: allowedips self-tests: pass [ 2033.097982] wireguard: nonce counter self-tests: pass [ 2033.535726] wireguard: ratelimiter self-tests: pass[ 2033.545615] wireguard: WireGuard 0.0.20190913-4-g5cca99692496 loaded. See www.wireguard.com for information. [ 2033.565197] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@xxxxxxxxx>. All Rights Reserved.
Modified chacha20-mips.S [2]: root@OpenWrt:~# rmmod wireguard.ko root@OpenWrt:~# insmod wg-speed-nround-stack.ko [ 2045.129910] wireguard: chacha20 self-tests: pass [ 2045.157824] wireguard: poly1305 self-tests: pass [ 2045.169962] wireguard: chacha20poly1305 self-tests: pass[ 2045.381034] wireguard: chacha20poly1305_encrypt: 1 bytes, 0.251 MB/sec, 1607 cycles [ 2045.600801] wireguard: chacha20poly1305_encrypt: 16 bytes, 4.174 MB/sec, 1555 cycles [ 2045.821437] wireguard: chacha20poly1305_encrypt: 64 bytes, 15.392 MB/sec, 1691 cycles [ 2046.044650] wireguard: chacha20poly1305_encrypt: 128 bytes, 22.082 MB/sec, 2379 cycles [ 2046.341509] wireguard: chacha20poly1305_encrypt: 1420 bytes, 35.615 MB/sec, 16739 cycles [ 2046.631333] wireguard: chacha20poly1305_encrypt: 1440 bytes, 36.117 MB/sec, 16705 cycles [ 2046.849614] wireguard: chacha20poly1305_decrypt: 1 bytes, 0.246 MB/sec, 1647 cycles [ 2047.069403] wireguard: chacha20poly1305_decrypt: 16 bytes, 4.056 MB/sec, 1600 cycles [ 2047.290036] wireguard: chacha20poly1305_decrypt: 64 bytes, 15.001 MB/sec, 1736 cycles [ 2047.513253] wireguard: chacha20poly1305_decrypt: 128 bytes, 21.666 MB/sec, 2429 cycles [ 2047.800102] wireguard: chacha20poly1305_decrypt: 1420 bytes, 35.480 MB/sec, 16785 cycles [ 2048.089967] wireguard: chacha20poly1305_decrypt: 1440 bytes, 35.979 MB/sec, 16759 cycles
[ 2048.110580] wireguard: blake2s self-tests: pass [ 2048.333719] wireguard: curve25519 self-tests: pass [ 2048.343547] wireguard: allowedips self-tests: pass [ 2048.357926] wireguard: nonce counter self-tests: pass [ 2048.785837] wireguard: ratelimiter self-tests: pass[ 2048.795781] wireguard: WireGuard 0.0.20190913-5-gee7c7eec8deb loaded. See www.wireguard.com for information. [ 2048.815389] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@xxxxxxxxx>. All Rights Reserved.
I don't see the extra store/load on the stack back in the results. So I think that this test proves enough that the extra nround on the stack is not a problem. Ard, I shall take a look on your hchacha code later this weekend. Greats, René [0]: https://github.com/vDorst/wireguard/commits/mips-bench[1]: https://github.com/vDorst/wireguard/commit/5cca9969249632820cb96548813a65d1f297aa8c [2]: https://github.com/vDorst/wireguard/commit/ee7c7eec8deb3d5d5dae2eec0be0aafca3fddbc2
Note that for xchacha, I also added a hchacha_block() routine based on your code (with the round count as the third argument) [0]. Please let me know if you see anything wrong with that. +.globl hchacha_block +.ent hchacha_block +hchacha_block: + .frame $sp, STACK_SIZE, $ra + + addiu $sp, -STACK_SIZE + + /* Save s0-s7 */ + sw $s0, 0($sp) + sw $s1, 4($sp) + sw $s2, 8($sp) + sw $s3, 12($sp) + sw $s4, 16($sp) + sw $s5, 20($sp) + sw $s6, 24($sp) + sw $s7, 28($sp) + + lw X0, 0(STATE) + lw X1, 4(STATE) + lw X2, 8(STATE) + lw X3, 12(STATE) + lw X4, 16(STATE) + lw X5, 20(STATE) + lw X6, 24(STATE) + lw X7, 28(STATE) + lw X8, 32(STATE) + lw X9, 36(STATE) + lw X10, 40(STATE) + lw X11, 44(STATE) + lw X12, 48(STATE) + lw X13, 52(STATE) + lw X14, 56(STATE) + lw X15, 60(STATE) + +.Loop_hchacha_xor_rounds: + addiu $a2, -2 + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12); + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7); + bnez $a2, .Loop_hchacha_xor_rounds + + sw X0, 0(OUT) + sw X1, 4(OUT) + sw X2, 8(OUT) + sw X3, 12(OUT) + sw X12, 16(OUT) + sw X13, 20(OUT) + sw X14, 24(OUT) + sw X15, 28(OUT) + + /* Restore used registers */ + lw $s0, 0($sp) + lw $s1, 4($sp) + lw $s2, 8($sp) + lw $s3, 12($sp) + lw $s4, 16($sp) + lw $s5, 20($sp) + lw $s6, 24($sp) + lw $s7, 28($sp) + + addiu $sp, STACK_SIZE + jr $ra +.end hchacha_block +.set at[0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=wireguard-crypto-library-api-v3&id=cc74a037f8152d52bd17feaf8d9142b61761484f