I've made enough mistakes in this series, I'll just start over. It's been hard for me not to be able to run test this in master, and have to go back and forth between it and 4.19; that's why I have messed up so many times. I apologize for the noise again. If you've read the cover letter from v1 and v2, there's not anything too relevant that I'm changing here. --- I've finally managed to get gcm(aes) working with the qce crypto engine. These first patch fixes a bug where the gcm authentication tag was being overwritten during gcm decryption, because it was passed in the same sgl buffer as the crypto payload. The qce driver appends some private state buffer to the request destination sgl, but it was not checking the length of the sgl being passed. The second patch works around a problem, which I frankly can't pinpoint what exactly is the cause, but after some help from Ard Biesheuvel, I think it is related to DMA. When gcm sends a request in crypto_gcm_setkey, it stores the hash (the crypto payload) and the iv in the same data struct. When the driver updates the IV, then the payload gets overwritten with the unencrypted data, or all zeroes, it may be a coincidence. However, it works if I pass the request down to the fallback driver--it is used by the driver to accept 192-bit-key requests. All I had to do was setup the fallback regardless of key size, and then check the payload length along with the keysize to pass the request to the fallback. This turns out to enhance performance, because of the avoided latency that comes with using the hardware. I've started with checking for a single 16-byte AES block, and that is enough to make gcm work. Next thing I've done was to tune the request size for performance. What got me started into looking at the qce driver was reports of it being detrimental to VPN speed, by the way. I've tested this win an Asus RT-AC58U, but the slow VPN reports[1] have more devices affected. Access to the device was kindly provided by @simsasss. I've added a 768-byte block size to tcrypt to get some measurements to come up with an optimal threshold to transition from software to hardware, and encountered another bug in the qce driver: it apparently cannot handle aes-xts requests that are greater than 512 bytes, but not a multiple of it. It failed with 768, 1280; XTS is usually used with a 512-byte sector (or a multiple of it), so I'm concluding that is the cause of failure. With that fixed, I added a module parameter to set the maximum request size that will be handled by the software fallback cipher and made some speed measurements using tcrypt to come up with an optimum value. I've documented this briefly in the parameter description, pointing out that gcm will not work if you set it to 0, and in better detail in the Kconfig help. TLDR: In the worst (where the hardware is slowest) case, hardware and software speed match at around 768 bytes, but I lowered the threshold to 512 to benefit the CPU offload. Here's a sample comparing three runs, using the proposed driver, varying the aes_sw_max_len parameter: 1st run will always use fallback, second run will use the default fallback for len <= 512, and third run will never use the fallback. testing speed of async cbc(aes) (cbc-aes-qce) encryption ------------------ ---------- ---------- ---------- aes_sw_max_len 32,768 512 0 ------------------ ---------- ---------- ---------- 128 bit 16 bytes 8,081,136 5,614,448 430,416 128 bit 64 bytes 13,152,768 13,205,952 1,745,088 128 bit 256 bytes 16,094,464 16,101,120 6,969,600 128 bit 512 bytes 16,701,440 16,705,024 12,866,048 128 bit 768 bytes 16,883,712 13,192,704 15,186,432 128 bit 1024 bytes 17,036,288 17,149,952 19,716,096 128 bit 2048 bytes 17,108,992 30,842,880 32,868,352 128 bit 4096 bytes 17,203,200 44,929,024 49,655,808 128 bit 8192 bytes 17,219,584 58,966,016 74,186,752 256 bit 16 bytes 6,962,432 1,943,616 419,088 256 bit 64 bytes 10,485,568 10,421,952 1,681,536 256 bit 256 bytes 12,211,712 12,160,000 6,701,312 256 bit 512 bytes 12,499,456 12,584,448 9,882,112 256 bit 768 bytes 12,622,080 12,550,656 14,701,824 256 bit 1024 bytes 12,750,848 16,079,872 19,585,024 256 bit 2048 bytes 12,812,288 28,293,120 27,693,056 256 bit 4096 bytes 12,939,264 34,234,368 44,142,592 256 bit 8192 bytes 12,845,056 50,274,304 63,520,768 The numbers vary from run to run, sometimes greatly. I've tried running the same tests with the arm-neon drivers, but the results don't change with any cipher mode, so I'm assuming the fallback is always aes-generic. I've made the measurements using an Asus RT-AC58U only, so I don't know how other hardware performs, but the user can always override the parameter, or even its default value. [1] https://forum.openwrt.org/t/ipsec-performance-issue/39690 Eneas U de Queiroz (3): crypto: qce - use cryptlen when adding extra sgl crypto: qce - use AES fallback for small requests crypto: qce - handle AES-XTS cases that qce fails drivers/crypto/Kconfig | 23 +++++++++++++++++++++++ drivers/crypto/qce/common.c | 2 -- drivers/crypto/qce/common.h | 3 +++ drivers/crypto/qce/dma.c | 11 ++++++----- drivers/crypto/qce/dma.h | 2 +- drivers/crypto/qce/skcipher.c | 30 ++++++++++++++++++++---------- 6 files changed, 53 insertions(+), 18 deletions(-)