On Fri, Feb 07, 2020 at 12:02:24PM -0300, Eneas U de Queiroz wrote: > I've made enough mistakes in this series, I'll just start over. It's > been hard for me not to be able to run test this in master, and have to > go back and forth between it and 4.19; that's why I have messed up so > many times. I apologize for the noise again. > > If you've read the cover letter from v1 and v2, there's not anything too > relevant that I'm changing here. > > --- > > I've finally managed to get gcm(aes) working with the qce crypto engine. > > These first patch fixes a bug where the gcm authentication tag was being > overwritten during gcm decryption, because it was passed in the same sgl > buffer as the crypto payload. The qce driver appends some private state > buffer to the request destination sgl, but it was not checking the > length of the sgl being passed. > > The second patch works around a problem, which I frankly can't pinpoint > what exactly is the cause, but after some help from Ard Biesheuvel, I > think it is related to DMA. When gcm sends a request in > crypto_gcm_setkey, it stores the hash (the crypto payload) and the iv in > the same data struct. When the driver updates the IV, then the payload > gets overwritten with the unencrypted data, or all zeroes, it may be a > coincidence. > > However, it works if I pass the request down to the fallback driver--it > is used by the driver to accept 192-bit-key requests. All I had to do > was setup the fallback regardless of key size, and then check the > payload length along with the keysize to pass the request to the > fallback. This turns out to enhance performance, because of the > avoided latency that comes with using the hardware. > > I've started with checking for a single 16-byte AES block, and that is > enough to make gcm work. Next thing I've done was to tune the request > size for performance. What got me started into looking at the qce > driver was reports of it being detrimental to VPN speed, by the way. > I've tested this win an Asus RT-AC58U, but the slow VPN reports[1] have > more devices affected. Access to the device was kindly provided by > @simsasss. > > I've added a 768-byte block size to tcrypt to get some measurements to > come up with an optimal threshold to transition from software to > hardware, and encountered another bug in the qce driver: it apparently > cannot handle aes-xts requests that are greater than 512 bytes, but not > a multiple of it. It failed with 768, 1280; XTS is usually used with a > 512-byte sector (or a multiple of it), so I'm concluding that is the > cause of failure. > > With that fixed, I added a module parameter to set the maximum request > size that will be handled by the software fallback cipher and made some > speed measurements using tcrypt to come up with an optimum value. > > I've documented this briefly in the parameter description, pointing out > that gcm will not work if you set it to 0, and in better detail in the > Kconfig help. > > TLDR: In the worst (where the hardware is slowest) case, hardware and > software speed match at around 768 bytes, but I lowered the threshold to > 512 to benefit the CPU offload. > > Here's a sample comparing three runs, using the proposed driver, varying > the aes_sw_max_len parameter: 1st run will always use fallback, second > run will use the default fallback for len <= 512, and third run will > never use the fallback. > > testing speed of async cbc(aes) (cbc-aes-qce) encryption > ------------------ ---------- ---------- ---------- > aes_sw_max_len 32,768 512 0 > ------------------ ---------- ---------- ---------- > 128 bit 16 bytes 8,081,136 5,614,448 430,416 > 128 bit 64 bytes 13,152,768 13,205,952 1,745,088 > 128 bit 256 bytes 16,094,464 16,101,120 6,969,600 > 128 bit 512 bytes 16,701,440 16,705,024 12,866,048 > 128 bit 768 bytes 16,883,712 13,192,704 15,186,432 > 128 bit 1024 bytes 17,036,288 17,149,952 19,716,096 > 128 bit 2048 bytes 17,108,992 30,842,880 32,868,352 > 128 bit 4096 bytes 17,203,200 44,929,024 49,655,808 > 128 bit 8192 bytes 17,219,584 58,966,016 74,186,752 > 256 bit 16 bytes 6,962,432 1,943,616 419,088 > 256 bit 64 bytes 10,485,568 10,421,952 1,681,536 > 256 bit 256 bytes 12,211,712 12,160,000 6,701,312 > 256 bit 512 bytes 12,499,456 12,584,448 9,882,112 > 256 bit 768 bytes 12,622,080 12,550,656 14,701,824 > 256 bit 1024 bytes 12,750,848 16,079,872 19,585,024 > 256 bit 2048 bytes 12,812,288 28,293,120 27,693,056 > 256 bit 4096 bytes 12,939,264 34,234,368 44,142,592 > 256 bit 8192 bytes 12,845,056 50,274,304 63,520,768 > > The numbers vary from run to run, sometimes greatly. > > I've tried running the same tests with the arm-neon drivers, but the > results don't change with any cipher mode, so I'm assuming the fallback > is always aes-generic. > > I've made the measurements using an Asus RT-AC58U only, so I don't know > how other hardware performs, but the user can always override the > parameter, or even its default value. > > [1] https://forum.openwrt.org/t/ipsec-performance-issue/39690 > > Eneas U de Queiroz (3): > crypto: qce - use cryptlen when adding extra sgl > crypto: qce - use AES fallback for small requests > crypto: qce - handle AES-XTS cases that qce fails > > drivers/crypto/Kconfig | 23 +++++++++++++++++++++++ > drivers/crypto/qce/common.c | 2 -- > drivers/crypto/qce/common.h | 3 +++ > drivers/crypto/qce/dma.c | 11 ++++++----- > drivers/crypto/qce/dma.h | 2 +- > drivers/crypto/qce/skcipher.c | 30 ++++++++++++++++++++---------- > 6 files changed, 53 insertions(+), 18 deletions(-) All applied. Thanks. -- Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt