On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen <pvanleeuwen@xxxxxxxxxxxxxx> wrote: > > That remark is just very stupid. The hardware ALREADY exists, and > more hardware is in the pipeline. Once this stuff is designed in, it > usually stays in for many years to come. And these are chips sold in > _serious_ quantities, to be used in things like wireless routers and > DSL, cable and FTTH modems, 5G base stations, etc. etc. Yes, I very much mentioned routers. I believe those can happen much more quickly. But I would very much hope that that is not the only situation where you'd see wireguard used. I'd want to see wireguard in an end-to-end situation from the very client hardware. So laptops, phones, desktops. Not the untrusted (to me) hw in between. > No, these are just the routers going into *everyone's* home. And 5G > basestations arriving at every other street corner. I wouldn't call > that rare, exactly. That's fine for a corporate tunnel between devices. Which is certainly one use case for wireguard. But if you want VPN for your own needs for security, you want it at the _client_. Not at the router box. So that case really does matter. And I really don't see the hardware happening in that space. So the bad crypto interfaces only make the client _worse_. See? But on to the arguments that we actually agree on: > Hey, no argument there. I don't see any good reason why the key can't > be on the stack. I doubt any hardware would be able to DMA that as-is > directly, and in any case, key changes should be infrequent, so copying > it to some DMA buffer should not be a performance problem. > So maybe that's an area for improvement: allow that to be on the stack. It's not even just the stack. It's really that the crypto interfaces are *designed* so that you have to allocate things separately, and can't embed these things in your own data structures. And they are that way, because the crypto interfaces aren't actually about (just) hiding the hardware interface: they are about hiding _all_ the encryption details. There's no way to say "hey, I know the crypto I use, I know the key size I have, I know the state size it needs, I can preallocate those AS PART of my own data structures". Because the interface is designed to be so "generic" that you simply can't do those things, they are all external allocations, which is inevitably slower when you don't have hardware. And you've shown that you don't care about that "don't have hardware" situation, and seem to think it's the only case that matters. That's your job, after all. But however much you try to claim otherwise, there's all these situations where the hardware just isn't there, and the crypto interface just forces nasty overhead for absolutely no good reason. > I already explained the reasons for _not_ doing direct calls above. And I've tried to explain how direct calls that do the synchronous thing efficiently would be possible, but then _if_ there is hardware, they can then fall back to an async interface. > > So there is absolutely NO DOWNSIDE for hw accelerated crypto to just > > do it right, and use an interface like this: > > > > if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0, > > PACKET_CB(skb)->nonce, key->key, > > simd_context)) > > return false; > > > Well, for one thing, a HW API should not expect the result to be > available when the function call returns. (if that's what you > mean here). That would just be WRONG. Right. But that also shouldn't mean that when you have synchronous hardware (ie CPU) you have to set everything up even though it will never be used. Put another way: even with hardware acceleration, the queuing interface should be a simple "do this" interface. The current crypto interface is basically something that requires all the setup up-front, whether it's needed or not. And it forces those very inconvenient and slow external allocations. And I'm saying that causes problems, because it fundamentally means that you can't do a good job for the common CPU case, because you're paying all those costs even when you need absolutely none of them. Both at setup time, but also at run-time due to the extra indirection and cache misses etc. > Again, HW acceleration does not depend on the indirection _at all_, > that's there for entirely different purposes I explained above. > HW acceleration _does_ depend greatly on a truly async ifc though. Can you realize that the world isn't just all hw acceleration? Can you admit that the current crypto interface is just horrid for the non-accelerated case? Can you perhaps then also think that "maybe there are better models". > So queue requests on one side, handle results from the other side > in some callback func off of an interrupt handler. Actually, what you can do - and what people *have* done - is to admit that the synchronous case is real and important, and then design interfaces that work for that one too. You don't need to allocate resources ahead of time, and you don't have to disallow just having the state buffer allocated by the caller. So here's the *wrong* way to do it (and the way that crypto does it): - dynamically allocate buffers at "init time" - fill in "callback fields" etc before starting the crypto, whether they are needed or not - call a "decrypt" function that then uses the indirect functions you set up at init time, and possibly waits for it (or calls the callbacks you set up) note how it's all this "state machine" model where you add data to the state machine, and at some point you say "execute" and then either you wait for things or you get callbacks. That makes sense for a hw crypto engine. It's how a lot of them work, after all. But it makes _zero_ sense for the synchronous case. You did a lot of extra work for that case, and because it was all a styate machine, you did it particularly inefficiently: not only do you have those separate allocations with pointer following, the "decrypt()" call ends up doing an indirect call to the CPU implementation, which is just quite slow to begin with, particularly in this day and age with retpoline etc. So what's the alternative? I claim that a good interface would accept that "Oh, a lot of cases will be synchronous, and a lot of cases use one fixed encryption/decryption model". And it's quite doable. Instead of having those callback fields and indirection etc, you could have something more akin to this: - let the caller know what the state size is and allocate the synchronous state in its own data structures - let the caller just call a static "decrypt_xyz()" function for xyz decryptrion. - if you end up doing it synchronously, that function just returns "done". No overhead. No extra allocations. No unnecessary stuff. Just do it, using the buffers provided. End of story. Efficient and simple. - BUT. - any hardware could have registered itself for "I can do xyz", and the decrypt_xyz() function would know about those, and *if* it has a list of accelerators (hopefully sorted by preference etc), it would try to use them. And if they take the job (they might not - maybe their queues are full, maybe they don't have room for new keys at the moment, which might be a separate setup from the queues), the "decrypt_xyz()" function returns a _cookie_ for that job. It's probably a pre-allocated one (the hw accelerator might preallocate a fixed number of in-progress data structures). And once you have that cookie, and you see "ok, I didn't get the answer immediately" only THEN do you start filling in things like callback stuff, or maybe you set up a wait-queue and start waiting for it, or whatever". See the difference in models? One forces that asynchronous model, and actively penalizes the synchronous one. The other _allows_ an asynchronous model, but is fine with a synchronous one. > > aead_request_set_callback(req, 0, NULL, NULL); > > > This is just inevitable for HW acceration ... See above. It really isn't. You could do it *after* the fact, when you've gotten that ticket from the hardware. Then you say "ok, if the ticket is done, use these callbacks". Or "I'll now wait for this ticket to be done" (which is what the above does by setting the callbacks to zero). Wouldn't that be lovely for a user? I suspect it would be a nice model for a hw accelerator too. If you have full queues or have problems allocating new memory or whatever, you just let the code fall back to the synchronous interface. > Trust me, I have whole list of things I don't like about the > API myself, it's not exacty ideal for HW acceleration either. That';s the thing. It's actively detrimental for "I have no HW acceleration". And apparently it's not optimal for you guys either. > But the point is - there are those case where you _don't_ know and > _that_ is what the Crypto API is for. And just generally, crypto > really _should_ be switchable. It's very much not what wireguard does. And honestly, most of the switchable ones have caused way more security problems than they have "fixed" by being switchable. Linus