On Thu, Sep 26, 2019 at 2:40 AM Pascal Van Leeuwen <pvanleeuwen@xxxxxxxxxxxxxx> wrote: > > While I agree with the principle of first merging Wireguard without > hooking it up to the Crypto API and doing the latter in a later, > separate patch, I DONT'T agree with your bashing of the Crypto API > or HW crypto acceleration in general. I'm not bashing hardware crypto acceleration. But I *am* bashing bad interfaces to it. Honestly, you need to face a few facts, Pascal. Really. First off: availability. (a) hardware doesn't exist right now outside your lab This is a fact. (b) hardware simply WILL NOT exist in any huge number for years, possibly decades. If ever, This is just reality, Pascal. Even if you release your hardware tomorrow, where do you think it will exists? Laptops? PC's? Phones? No. No. And no. Phones are likely the strongest argument for power use, but phones won't really start using it until it's on the SoC, because while they care deeply about power, they care even more deeply about a lot of other things a whole lot more. Form factor, price, and customers that care. So phones aren't happening. Not for years, and not until it's a big deal and standard IP that everybody wants. Laptops and PC's? No. Look at all the crypto acceleration they have today. That was sarcasm, btw, just to make it clear. It's simply not a market. End result: even with hardware, the hardware will be very rare. Maybe routers that want to sell particular features in the near future. Again, this isn't theory. This is that pesky little thing called "reality". It sucks, I know. But even if you *IGNORE* the fact that hardware doesn't exist right now, and won't be widely available for years (or longer), there's another little fact that you are ignoring: The async crypto interfaces are badly designed. Full stop. Seriously. This isn't rocket science. This is very very basic Computer Science 101. Tell me, what's _the_ most important part about optimizing something? Yeah, it's "you optimize for the common case". But let's ignore that part, since I already said that hardware isn't the common case but I promised that I'd ignore that part. The _other_ most important part of optimization is that you optimize for the case that _matters_. And the async crypto case got that completely wrong, and the wireguard patch shows that issue very clearly. The fact is, even you admit that a few CPU cycles don't matter for the async case where you have hardware acceleration, because the real cost is going to be in the IO - whether it's DMA, cache coherent interconnects, or some day some SoC special bus. The CPU cycles just don't matter, they are entirely in the noise. What does that mean? Think about it. [ Time passes ] Ok, I'll tell you: it means that the interface shouldn't be designed for async hw-assisted crypto. The interface should be designed for the case that _does_ care about CPU cycles, and then the async hw-assisted crypto should be hidden by a conditional, and its (extra) data structures should be the ones that are behind some indirect pointers etc. Because, as we agreed, the async external hw case really doesn't care. It it has to traverse a pointer or two, and if it has to have a *SEPARATE* keystore that has longer lifetimes, then the async code can set that up on its own, but that should not affect the case that cares. Really, this is fundamental, and really, the crypto interfaces got this wrong. This is in fact _so_ fundamental that the only possible reason you can disagree is because you don't care about reality or fundamentals, and you only care about the small particular hardware niche you work in and nothing else. You really should think about this a bit. > However, if you're doing bulk crypto like network packet processing > (as Wireguard does!) or disk/filesystem encryption, then that cipher > allocation only happens once every blue moon and the overhead for > that is totally *irrelevant* as it is amortized over many hours or > days of runtime. This is not true. It's not true because the overhead happens ALL THE TIME. And in 99.9% of all cases there are no upsides from the hw crypto, because the hardware DOES NOT EXIST. You think the "common case" is that hardware encryption case, but see above. It's really not. And when you _do_ have HW encryption, you could do the indirection. But that's not an argument for always having the indirection. What indirection am I talking about? There's multiple levels of indirection in the existing bad crypto interfaces: (a) the data structures themselves. This is things like keys and state storage, both of which are (often) small enough that they would be better off on a stack, or embedded in the data structures of the callers. (b) the calling conventions. This is things like indirection - usually several levels - just to find the function pointer to call to, and then an indirect call to that function pointer. and both of those are actively bad things when you don't have those hardware accelerators. When you *do* have them, you don't care. Everybody agrees about that. But you're ignoring the big white elephant in the room, which is that the hw really is very rare in the end, even if it were to exist at all. > While I generally dislike this whole hype of storing stuff in > textual formats like XML and JSON and then wasting lots of CPU > cycles on parsing that, I've learned to appreciate the power of > these textual Crypto API templates, as they allow a hardware > accelerator to advertise complex combined operations as single > atomic calls, amortizing the communication overhead between SW > and HW. It's actually very flexible and powerful! BUT IT IS FUNDAMENTALLY BADLY DESIGNED! Really. You can get the advantages of hw-accelerated crypto with good designs. The current crypto layer is not that. The current crypto layer simply has indirection at the wrong level. Here's how it should have been done: - make the interfaces be direct calls to the crypto you know you want. - make the basic key and state buffer be explicit and let people allocate them on their own stacks or whatever "But what about hw acceleration?" - add a single indirect private pointer that the direct calls can use for their own state IF THEY HAVE REASON TO - make the direct crypto calls just have a couple of conditionals inside of them Why? Direct calls with a couple of conditionals are really cheap for the non-accelerated case. MUCH cheaper than the indirection overhead (both on a state level and on a "crypto description" level) that the current code has. Seriously. The hw accelerated crypto won't care about the "call a static routine" part. The hw accelerated crypto won't care about the "I need to allocate a copy of the key because I can't have it on stack, and need to have it in a DMA'able region". The hw accelerated crypto won't care about the two extra instructions that do "do I have any extra state at all, or should I just do the synchronous CPU version" before it gets called through some indirect pointer. So there is absolutely NO DOWNSIDE for hw accelerated crypto to just do it right, and use an interface like this: if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0, PACKET_CB(skb)->nonce, key->key, simd_context)) return false; because for the hw accelerated case the costs are all elsewhere. But for synchronous encryption code on the CPU? Avoiding the indirection can be a huge win. Avoiding allocations, extra cachelines, all that overhead. Avoiding indirect calls entirely, because doing a few conditional branches (that will predict perfectly) on the state will be a lot more efficient, both in direct CPU cycles and in things like I$ etc. In contrast, forcing people to use this model: if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) { req = aead_request_alloc(key->tfm, GFP_ATOMIC); if (!req) return false; } else { req = &stackreq; aead_request_set_tfm(req, key->tfm); } aead_request_set_ad(req, 0); aead_request_set_callback(req, 0, NULL, NULL); aead_request_set_crypt(req, sg, sg, skb->len, (u8 *)&PACKET_CB(skb)->ivpad); err = crypto_aead_decrypt(req); if (unlikely(req != &stackreq)) aead_request_free(req); if (err) return false; isn't going to help anybody. It sure as hell doesn't help the CPU-synchronous case, and see above: it doesn't even help the hw accelerated case. It could have had _all_ that "tfm" work behind a private pointer that the CPU case never touches except to see "ok, it's NULL, I don't have any". See? The interface *should* be that chacha20poly1305_decrypt_sg() library interface, just give it a private pointer it can use and update. Then, *internally* if can do something like bool chacha20poly1305_decrypt_sg(...) { struct cc20p1305_ptr *state = *state_p; if (state) { .. do basically the above complex thing .. return ret; .. or fall back to sw if the hw queues are full.. } .. do the CPU only thing.. } and now you'd have no extra obverhead for the no-hw-accel case, you'd have a more pleasant interface to use, and you could still use hw accel if you wanted to. THIS is why I say that the crypto interface is bad. It was designed for the wrong objectives. It was designed purely for a SSL-like model where you do a complex key and algorithm exchange dance, and you simply don't know ahead of time what crypto you are even using. And yes, that "I'm doing the SSL thing" used to be a major use of encryption. I understand why it happened. It was what people did in the 90's. People thought it was a good idea back then, and it was also most of the hw acceleration world. And yes, in that model of "I don't have a clue of what crypto I'm even using" the model works fine. But honestly, if you can't admit to yourself that it's wrong for the case where you _do_ know the algorithm, you have some serious blinders on. Just from a user standpoint, do you seriously think users _like_ having to do the above 15+ lines of code, vs the single function call? The crypto interface really isn't pleasant, and you're wrong to believe that it really helps. The hw acceleration capability could have been abstracted away, instead of making that indirection be front and center. And again - I do realize the historical reasons for it. But understanding that doesn't magically make it wonderful. Linus