> > > > Replace the chacha20poly1305() library calls with invocations of the > > RFC7539 AEAD, as implemented by the generic chacha20poly1305 template. > > Honestly, the other patches look fine to me from what I've seen (with > the small note I had in a separate email for 11/18), but this one I > consider just nasty, and a prime example of why people hate those > crypto lookup routines. > > Some of it is just the fundamental and pointless silly indirection, > that just makes things harder to read, less efficient, and less > straightforward. > > That's exemplified by this part of the patch: > > > struct noise_symmetric_key { > > - u8 key[NOISE_SYMMETRIC_KEY_LEN]; > > + struct crypto_aead *tfm; > > which is just one of those "we know what we want and we just want to > use it directly" things, and then the crypto indirection comes along > and makes that simple inline allocation of a small constant size > (afaik it is CHACHA20POLY1305_KEY_SIZE, which is 32) be another > allocation entirely. > > And it's some random odd non-typed thing too, so then you have that > silly and stupid dynamic allocation using a name lookup: > > crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, CRYPTO_ALG_ASYNC); > > to create what used to be (and should be) a simple allocation that was > has a static type and was just part of the code. > While I agree with the principle of first merging Wireguard without hooking it up to the Crypto API and doing the latter in a later, separate patch, I DONT'T agree with your bashing of the Crypto API or HW crypto acceleration in general. Yes, I do agree that if you need to do the occasional single crypto op for a fixed algorithm on a small amount of data then you should just use a simple direct library call. I'm all for a Zinc type library for that. (and I believe Ard is actually actively making such changes already) However, if you're doing bulk crypto like network packet processing (as Wireguard does!) or disk/filesystem encryption, then that cipher allocation only happens once every blue moon and the overhead for that is totally *irrelevant* as it is amortized over many hours or days of runtime. While I generally dislike this whole hype of storing stuff in textual formats like XML and JSON and then wasting lots of CPU cycles on parsing that, I've learned to appreciate the power of these textual Crypto API templates, as they allow a hardware accelerator to advertise complex combined operations as single atomic calls, amortizing the communication overhead between SW and HW. It's actually very flexible and powerful! > It also ends up doing other bad things, ie that packet-time > > + if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) { > + req = aead_request_alloc(key->tfm, GFP_ATOMIC); > + if (!req) > + return false; > > thing that hopefully _is_ unlikely, but that's just more potential > breakage from that whole async crypto interface. > > This is what people do *not* want to do, and why people often don't > like the crypto interfaces. > Life is all about needing to do things you don't like to do ... If you want the performance, you need to do the effort. That simple. HW acceleration surely won't work from a naive synchronous interface. (Same applies to running crypto in a separate thread on the CPU BTW!) In any case, Wireguard bulk crypto *should* eventually run on top of Crypto API such that it can leverage *existing* HW acceleration. It would be incredibly silly not to do so, given the HW exists! > And I'm still not convinced (a) ever makes sense - the overhead of any > accelerator is just high enought that I doubt you'll have numbers - > performance _or_ power. > You shouldn't make such assertions if you obviously don't know what you're talking about. Yes, there is significant overhead on the CPU for doing lookaside crypto, but it's (usually) nothing compared to doing the actual crypto itself on the CPU barring a few exceptions. (Notably AES-GCM or AES-CTR on ARM64 or x64 CPU's and *maybe* Chacha-Poly on recent Intel CPU's - but there's a *lot* more crypto being used out there than just AES-GCM and Chacha-Poly, not to mention a lot more less capable (embedded) CPU's running Linux) For anything but those exceptions, we blow even the fastest Intel server CPU's out of the water with our crypto accelerators. (I can bore you with some figures actually measured with the Crypto API on our HW, once I'm done optimizing the driver and I have some time to collect the results) And in any case, for somewhat larger blocks/packets, the overhead on the CPU would at least be such that it's less than what the CPU would need to do the crypto itself - even if it's faster - such that there is room there to do *other*, presumably more useful, work. Then there's indeed the power consumption issue, which is complex because crypto power != total system power so it depends too much on the actual use case to make generic statements on it. So I'll leave that with the remark that Intel server CPU's have to seriously throttle down their clock if you start using AVX512 for crypto, just to stay within their power budget, while we can do the same performance (~200 Gbps) in just a few (~2) Watts on a similar technology node. (excluding the CPU management overhead, but that surely won't consume excessive power like AVX512) > But even if you're right that it might be a power advantage on some > platform, that wouldn't make it an advantage on other platforms. Maybe > it could be done as a config option where you can opt in to the async > interface when that makes sense - but not force the indirection and > extra allocations when it doesn't. As a separate patch, something like > that doesn't sound horrendous (and I think that's also an argument for > doing that CPU->LE change as an independent change). > Making it a switch sounds good to me though. Regards, Pascal van Leeuwen Silicon IP Architect, Multi-Protocol Engines @ Verimatrix www.insidesecure.com