2020-08-10, 12:09:40 -0400, Scott Dial wrote: > On 8/10/2020 9:34 AM, Sabrina Dubroca wrote: > > [adding the linux-crypto list] > > > > 2020-08-06, 23:48:16 -0400, Scott Dial wrote: > >> On 8/6/2020 5:11 PM, Ryan Cox wrote: > >>> With 5.7 I get: > >>> * 9.90 Gb/s with no macsec at all > >>> * 1.80 Gb/s with macsec WITHOUT encryption > >>> * 1.00 Gb/s (sometimes, but often less) with macsec WITH encryption > >>> > >>> With 5.7 but with ab046a5d4be4c90a3952a0eae75617b49c0cb01b reverted, I get: > >>> * 9.90 Gb/s with no macsec at all > >>> * 7.33 Gb/s with macsec WITHOUT encryption > >>> * 9.83 Gb/s with macsec WITH encryption > >>> > >>> On tests where performance is bad (including macsec without encryption), > >>> iperf3 is at 100% CPU usage. I was able to run it under `perf record`on > >>> iperf3 in a number of the tests but, unfortunately, I have had trouble > >>> compiling perf for my own 5.7 compilations (definitely PEBKAC). If it > >>> would be useful I can work on fixing the perf compilation issues. > >> > >> For certain, you are measuring the difference between AES-NI doing > >> gcm(aes) and gcm_base(ctr(aes-aesni),ghash-generic). Specifically, the > >> hotspot is ghash-generic's implementation of ghash_update() function. > >> I appreciate your testing because I was limited in my ability to test > >> beyond 1Gb/s. > >> > >> The aes-aesni driver is smart enough to use the FPU if it's not busy and > >> fallback to the CPU otherwise. Unfortunately, the ghash-clmulni driver > >> does not have that kind of logic in it and only provides an async version, > >> so we are forced to use the ghash-generic implementation, which is a pure > >> CPU implementation. The ideal would be for aesni_intel to provide a > >> synchronous version of gcm(aes) that fell back to the CPU if the FPU is > >> busy. > >> I don't know if the crypto maintainers would be open to such a change, but > >> if the choice was between reverting and patching the crypto code, then I > >> would work on patching the crypto code. > > > > To the crypto folks, a bit of context: Scott wrote commit ab046a5d4be4 > > ("net: macsec: preserve ingress frame ordering"), which made MACsec > > use gcm(aes) with CRYPTO_ALG_ASYNC. This prevents out of order > > decryption, but reduces performance. We'd like to restore performance > > on systems where the FPU is available without breaking MACsec for > > systems where the FPU is often busy. > > > > A quick and dirty alternative might be to let the administrator decide > > if they're ok with some out of order. Maybe they know that their FPU > > will be mostly idle so it won't even be an issue (or maybe the > > opposite, ie keep the fast default and let admins fix their setups > > with an extra flag). > > I can appreciate favoring performance over correctness as practical > concern, but I'd suggest that the out-of-order decryption *is* a > performance concern as well. We can debate realness of my workload, but > even in Ryan's tests on an otherwise idle server, he showed 0.07% of the > frames needed to be dispatched to cryptd, and that for whatever reason > it's more often with encryption disabled, which correlates to his > decrease in throughput (9.83 Gb/s to 7.33 Gb/s, and 9.19 Gb/s to 6.00 > Gb/s), perhaps causing exponential backoff from TCP retries. I can > resurrect my test setup, but my numbers were worse than Ryan's. > > In any case, I counted 18 implementations of HW accelerated gcm(aes) in > the kernel, with 3 of those implementations are in arch (x86, arm64, and > s390) and the rest are crypto device drivers. Of all those > implementations, the AES-NI implementation is the only one that > dispatches to cryptd (via code in cypto/simd.c). AFAICT, every other > implementation of gcm(aes) is synchronous, but they would require closer > inspection to be certain. I randomly picked 2 of them (chcr and inside-secure), and they both set CRYPTO_ALG_ASYNC, so I guess not. > So, I'd like to focus on what we can do to > improve crypto/simd.c to provide a synchronous implementation of > gcm(aes) for AES-NI when possible, which is the vast majority of the time. > > I would be interested in proposing a change to improve this issue, but > I'm not sure the direction that the maintainers of this code would > prefer. Since these changes to the crypto API are fairly recent, there > may be context that I am not aware of. However, I think it would be > straight-forward to add another API to crypto/simd.c that allocated sync > algorithms, and I would be willing to do the work. > > The only challenge I see in implementing such a change is deciding how > to select a fallback algorithm. The most flexible solution would be to > call crypto_alloc_aead with CRYPTO_ALG_ASYNC during the init to pick the > "best" fallback (in case there is alternative HW offloading available), > but that would almost certainly pick itself and it's not obvious to me > how to avoid that. It's probably possible to add a PURE_SOFTWARE or whatever flag and request one of those algorithms for the fallback. > On the other hand, the caller to the new API could > explicitly declare a fallback algorithm (e.g., > "gcm_base(ctr(aes-aesni),ghash-generic)"), which probably is the correct > answer anyways -- I would try to avoid that, it seems too error-prone to me. > what are the chances that there is multiple HW > offloads for gcm(aes)? In that case, a possible API would be: > int simd_register_aeads_compat_sync(struct aead_alg *algs, > char **fallback_algs, > int count, > struct simd_aead_alg **simd_algs); > > Beyond MACsec, it's worth noting that the mac80211 code for AES-GCMP and > BIP-GMAC also use gcm(aes) in sync mode because decryption occurs in a > softirq, however I imagine nobody has reported an issue because the link > speed is typically slower and those encryption modes are still uncommon. Decent wireless cards would do the encryption in hw, no? Also, you can't notice a performance regression if it's never used the fast implementation :) -- Sabrina