I've finally managed to get gcm(aes) working with the qce crypto engine. These first patch fixes a bug where the gcm authentication tag was being overwritten during gcm decryption, because it was passed in the same sgl buffer as the crypto payload. The qce driver appends some private state buffer to the request destination sgl, but it was not checking the length of the sgl being passed. The second patch works around a problem, which I frankly can't pinpoint what exactly is the cause, but after some help from Ard Biesheuvel, I think it is related to DMA. When gcm sends a request in crypto_gcm_setkey, it stores the hash (the crypto payload) and the iv in the same data struct. When the drivers updates the IV, then the payload gets overwritten with the unencrypted data, or all zeroes, it may be a coincidence. However, it works if I pass the request down to the fallback driver--it is used by the driver to accept 192-bit-key requests. All I had to do was setup the fallback regardless of key size, and then check the payload length along with the keysize to pass the request to the fallback. This turns out to enhance performance, because of the avoided latency that comes with using the hardware. I've started with checking for a single 16-byte AES block, and that is enough to make gcm work. Next thing I've done was to tune the request size for performance. What got me started into looking at the qce driver was reports of it being detrimental to VPN speed, by the way. I've tested this win an Asus RT-AC58U, but the slow VPN reports[1] have more devices affected. Access to the device was kindly provided by @simsasss. I've used the openssl speed util to measure the speed, with an AF_ALG engine I've written to make use of the kernel driver from userspace[2], running on 4.19.78--I can't run this on a newer kernel yet. TLDR: In the worst (where the hardware is slowest) case, hardware and software speed match at aroung 768 bytes, but I lowered the threshold to 512 to benefit the CPU offload. Here's the script I've used: #!/bin/sh for len in 256 512 768 1024; do echo Block-size: ${len} bytes for key in 128 256; do for mode in cbc ctr ecb; do rmmod qcrypto openssl speed -elapsed -evp aes-${key}-${mode} -engine afalg \ -bytes ${len} 2>&1 \ | grep ^aes \ | sed "s/aes-${key}-${mode} /aes-${key}-${mode} soft/" insmod /tmp/qcrypto.ko openssl speed -elapsed -evp aes-${key}-${mode} -engine afalg \ -bytes ${len} 2>&1 \ | grep ^aes \ | sed "s/aes-${key}-${mode} /aes-${key}-${mode} qce /" done done done Here's a sample run--numbers vary from run to run, sometimes greatly: ./test_speed.sh Block-size: 256 bytes aes-128-cbc soft 6808.92k aes-128-cbc qce 2704.10k aes-128-ctr soft 6785.63k aes-128-ctr qce 2675.07k aes-128-ecb soft 7596.86k aes-128-ecb qce 2772.16k aes-256-cbc soft 5970.02k aes-256-cbc qce 2678.84k aes-256-ctr soft 6164.46k aes-256-ctr qce 2634.15k aes-256-ecb soft 6529.03k aes-256-ecb qce 2720.88k Block-size: 512 bytes aes-128-cbc soft 9402.31k aes-128-cbc qce 5345.69k aes-128-ctr soft 9766.23k aes-128-ctr qce 5179.25k aes-128-ecb soft 10638.85k aes-128-ecb qce 5437.13k aes-256-cbc soft 7742.98k aes-256-cbc qce 5230.08k aes-256-ctr soft 8174.93k aes-256-ctr qce 5115.89k aes-256-ecb soft 8772.61k aes-256-ecb qce 7282.35k Block-size: 768 bytes aes-128-cbc soft 10466.38k aes-128-cbc qce 7814.59k aes-128-ctr soft 11161.69k aes-128-ctr qce 7639.93k aes-128-ecb soft 12122.37k aes-128-ecb qce 10764.84k aes-256-cbc soft 8725.50k aes-256-cbc qce 9184.41k aes-256-ctr soft 9233.15k aes-256-ctr qce 7392.32k aes-256-ecb soft 10039.30k aes-256-ecb qce 9148.45k Block-size: 1024 bytes aes-128-cbc soft 11418.80k aes-128-cbc qce 12314.37k aes-128-ctr soft 11940.86k aes-128-ctr qce 11982.51k aes-128-ecb soft 13350.23k aes-128-ecb qce 10375.28k aes-256-cbc soft 9003.32k aes-256-cbc qce 12017.66k aes-256-ctr soft 9898.89k aes-256-ctr qce 9672.18k aes-256-ecb soft 10679.74k aes-256-ecb qce 12314.37k I imagine that if I were to run the benchmark within the kernel, the resulting threshould would be eve higher, since there's a pretty much fixed latency from the context switches. Nonetheless, I think it's better to let the engine run more, to offload the CPU. Cheers, Eneas [1] https://forum.openwrt.org/t/ipsec-performance-issue/39690 [2] https://github.com/cotequeiroz/afalg_engine Eneas U de Queiroz (2): crypto: qce - use cryptlen when adding extra sgl crypto: qce - use AES fallback when len <= 512 drivers/crypto/qce/dma.c | 11 ++++++----- drivers/crypto/qce/dma.h | 2 +- drivers/crypto/qce/skcipher.c | 17 +++++++---------- 3 files changed, 14 insertions(+), 16 deletions(-)