> > That remark is just very stupid. The hardware ALREADY exists, and > > more hardware is in the pipeline. Once this stuff is designed in, it > > usually stays in for many years to come. And these are chips sold in > > _serious_ quantities, to be used in things like wireless routers and > > DSL, cable and FTTH modems, 5G base stations, etc. etc. > > Yes, I very much mentioned routers. I believe those can happen much > more quickly. > > But I would very much hope that that is not the only situation where > you'd see wireguard used. > Same here > I'd want to see wireguard in an end-to-end situation from the very > client hardware. So laptops, phones, desktops. Not the untrusted (to > me) hw in between. > I don't see why the crypto HW would deserve any less trust than, say, the CPU itself. I would say CPU's don't deserve that trust at the moment. > > No, these are just the routers going into *everyone's* home. And 5G > > basestations arriving at every other street corner. I wouldn't call > > that rare, exactly. > > That's fine for a corporate tunnel between devices. Which is certainly > one use case for wireguard. > > But if you want VPN for your own needs for security, you want it at > the _client_. Not at the router box. So that case really does matter. > Personally, I would really like it in my router box so my CPU is free to do useful work instead of boring crypto. And I know there's nothing untrusted in between my client and the router box, so I don't need to worry about security there. But hey, that's just me. > And I really don't see the hardware happening in that space. So the > bad crypto interfaces only make the client _worse_. > Fully agree. We don't focus on the client side with our HW anyway. (but than there may be that router box in between that can help out) > See? > > But on to the arguments that we actually agree on: > > > Hey, no argument there. I don't see any good reason why the key can't > > be on the stack. I doubt any hardware would be able to DMA that as-is > > directly, and in any case, key changes should be infrequent, so copying > > it to some DMA buffer should not be a performance problem. > > So maybe that's an area for improvement: allow that to be on the stack. > > It's not even just the stack. It's really that the crypto interfaces > are *designed* so that you have to allocate things separately, and > can't embed these things in your own data structures. > > And they are that way, because the crypto interfaces aren't actually > about (just) hiding the hardware interface: they are about hiding > _all_ the encryption details. > Well, that's the general idea of abstraction. It also allows for swapping in any other cipher with minimal effort just _because_ the details were hidden from the application. So it may cost you some effort initially, but it may save you effort later. > There's no way to say "hey, I know the crypto I use, I know the key > size I have, I know the state size it needs, I can preallocate those > AS PART of my own data structures". > > Because the interface is designed to be so "generic" that you simply > can't do those things, they are all external allocations, which is > inevitably slower when you don't have hardware. > Hmm, Ok, I see your point here. But most of those data structures (like the key) should be allocated infrequently anyway, so you can amortize that cost over _many_ crypto operations. You _do_ realize that performing the key schedule for e.g. AES with AES-NI also takes quite a lot of time? So you should keep your keys alive and not reload them all the time anyway. But I already agreed with you that there may be cases where you just want to call the library function directly. Wireguard just isn't one of those cases, IMHO. > And you've shown that you don't care about that "don't have hardware" > situation, and seem to think it's the only case that matters. That's > your job, after all. > I don't recall putting it that strongly ... and I certainly never said the HW acceleration thing is the _only_ case that matters. But it does matter _significantly_ to me, for blatantly obvious reasons. > But however much you try to claim otherwise, there's all these > situations where the hardware just isn't there, and the crypto > interface just forces nasty overhead for absolutely no good reason. > > > I already explained the reasons for _not_ doing direct calls above. > > And I've tried to explain how direct calls that do the synchronous > thing efficiently would be possible, but then _if_ there is hardware, > they can then fall back to an async interface. > OK, I did not fully get that latter part. I would be fine with such an approach for use cases (i.e. fixed, known crypto) where that makes sense. It would actually be better than calling the SW-only library directly (which was my suggestion) as it would still allow HW acceleration as an option ... > > > So there is absolutely NO DOWNSIDE for hw accelerated crypto to just > > > do it right, and use an interface like this: > > > > > > if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0, > > > PACKET_CB(skb)->nonce, key->key, > > > simd_context)) > > > return false; > > > > > Well, for one thing, a HW API should not expect the result to be > > available when the function call returns. (if that's what you > > mean here). That would just be WRONG. > > Right. But that also shouldn't mean that when you have synchronous > hardware (ie CPU) you have to set everything up even though it will > never be used. > > Put another way: even with hardware acceleration, the queuing > interface should be a simple "do this" interface. > OK, I don't think we disagree there. I _like_ simple. As long as it doesn't sacrifice functionality I care about. > The current crypto interface is basically something that requires all > the setup up-front, whether it's needed or not. And it forces those > very inconvenient and slow external allocations. > But you should do the setup work (if by "setup" you mean things like cipher allocation, key setup and request allocation) only _once_ in a _long_ while. You can just keep using it for the lifetime of the application (or key, for the key setup part). If I look at my cipher fallback paths in the driver (the only places where I actually get to _use_ the API from the "other" side), per actual indivual request they _only_ do - the rest is all preallocated earlier: _set_callback() _set_crypt() _set_ad() _encrypt() or _decrypt() And now that I look at that, I think the _set_callback() could move to the setup phase as it's always the same callback function. Probably, in case of Wireguard, you could even move the _set_ad() there as it's always zero and the crypto driver is not allowed to overwrite it in the request struct anyway. Also, I already agreed with you that _set_crypt(), _set_ad() and _encrypt()/_decrypt() _could_ be conveniently wrapped into one API call instead of 3 separate ones if we think that's worth it. BUT ... actually ... I just looked at the actual _implementation_ and it turns out these are _inlineable_ functions defined in the header file that _just_ write to some struct fields. So they should not end up being function calls at all(!!). _Only_ the _encrypt()/_decrypt() invocation will end up with a true (indirect) function call. So where are all those allocations you mention? Have you ever actually _used_ the Crypto API for anything? Yes, if you actually want to _queue_ requests you need to use one request struct for every queued operation, but you could just preallocate an array of them that you cycle through. No need to do those allocations in the hot path. So is your problem really with the API _itself_ or with incorrect/ inefficient _use_ of the API in some places? > And I'm saying that causes problems, because it fundamentally means > that you can't do a good job for the common CPU case, because you're > paying all those costs even when you need absolutely none of them. > Both at setup time, but also at run-time due to the extra indirection > and cache misses etc. > There is some cost sure, but is it _significant_ for any use case that _matters_? You started bringing up optimization rules, so how about Amdahls law? > > Again, HW acceleration does not depend on the indirection _at all_, > > that's there for entirely different purposes I explained above. > > HW acceleration _does_ depend greatly on a truly async ifc though. > > Can you realize that the world isn't just all hw acceleration? > Sure. But there's also a lot of HW acceleration _already out there_ that _could_ have been used if only the proper SW API's had existed. Welcome to _my_ world. > Can you admit that the current crypto interface is just horrid for the > non-accelerated case? > Is agreeing that it is not perfect sufficient for you? :-) > Can you perhaps then also think that "maybe there are better models". > Sure. There's always better. There's also good enough though ... > > So queue requests on one side, handle results from the other side > > in some callback func off of an interrupt handler. > > Actually, what you can do - and what people *have* done - is to admit > that the synchronous case is real and important, and then design > interfaces that work for that one too. > But they _do_ work for that case as well. I still haven't seen any solid evidence that they are as horribly inefficient as you are implying for _real life_ use cases. And even if they are, then there's the question whether that is the fault of the API or incorrect use thereof. > You don't need to allocate resources ahead of time, and you don't have > to disallow just having the state buffer allocated by the caller. > > So here's the *wrong* way to do it (and the way that crypto does it): > > - dynamically allocate buffers at "init time" > Why is that so "wrong"? It sure beats doing allocations on the hot path. But yes, some stuff should be allowed to live on the stack. Some other stuf can't be on the stack though, as that's gone when the calling function exits while the background crypto processing still needs it. And you don't want to have it on the stack initially and then have to _copy_ it to some DMA-able location that you allocate on the fly on the hot path if you _do_ want HW acceleration. > - fill in "callback fields" etc before starting the crypto, whether > they are needed or not > I think this can be done _once_ at request allocation time. But it's just one function pointer write anyway. Is that significant? Or: _if_ that is significant, you shouldn't be using the Crypto API for that use case in the first place. > - call a "decrypt" function that then uses the indirect functions you > set up at init time, and possibly waits for it (or calls the callbacks > you set up) > > note how it's all this "state machine" model where you add data to the > state machine, and at some point you say "execute" and then either you > wait for things or you get callbacks. > Not sure how splitting data setup over a few seperate "function" calls suddenly makes it a "state machine model" ... But yes, I can understand why the completion handling through this callback function seems like unnecessary complication for the SW case. > That makes sense for a hw crypto engine. It's how a lot of them work, after all. > Oh really? Can't speak for other people stuff, but for our hardware you post a request to it and then go off do other stuff while the HW does its thing after which it will inform you it's done by means of an interrupt. I don't see how this relates to the "statemachine model" above, there is no persistent state involved, it's all included in the request. The _only_ thing that matters is that you realize it's a pipeline that needs to be kept filled and has latency >> throughput, just like your CPU pipeline. > But it makes _zero_ sense for the synchronous case. You did a lot of > extra work for that case, and because it was all a styate machine, you > did it particularly inefficiently: not only do you have those separate > allocations with pointer following, the "decrypt()" call ends up doing > an indirect call to the CPU implementation, which is just quite slow > to begin with, particularly in this day and age with retpoline etc. > > So what's the alternative? > > I claim that a good interface would accept that "Oh, a lot of cases > will be synchronous, and a lot of cases use one fixed > encryption/decryption model". > > And it's quite doable. Instead of having those callback fields and > indirection etc, you could have something more akin to this: > > - let the caller know what the state size is and allocate the > synchronous state in its own data structures > > - let the caller just call a static "decrypt_xyz()" function for xyz > decryptrion. > Fine for those few cases where the algorithm is known and fixed. (You do realize that the primary use cases are IPsec, dmcrypt and fscrypt where that is most definitely _not_ the case?) Also, you're still ignoring the fact that there is not one, single, optimal, CPU implementation either. You have to select that as well, based on CPU features. So it's either an indirect function call that would be well predictable - as it's always the same at that point in the program - or it's a deep if-else tree (which might actually be implemented by the compiler as an indirect (table) jump ...) selecting the fastest implementation, either SW _or_ HW. > - if you end up doing it synchronously, that function just returns > "done". No overhead. No extra allocations. No unnecessary stuff. Just > do it, using the buffers provided. End of story. Efficient and simple. > I don't see which "extra allocations" you would be saving here. Those shouldn't happen in the hot path either way. > - BUT. > > - any hardware could have registered itself for "I can do xyz", and > the decrypt_xyz() function would know about those, and *if* it has a > list of accelerators (hopefully sorted by preference etc), it would > try to use them. And if they take the job (they might not - maybe > their queues are full, maybe they don't have room for new keys at the > moment, which might be a separate setup from the queues), the > "decrypt_xyz()" function returns a _cookie_ for that job. It's > probably a pre-allocated one (the hw accelerator might preallocate a > fixed number of in-progress data structures). > > And once you have that cookie, and you see "ok, I didn't get the > answer immediately" only THEN do you start filling in things like > callback stuff, or maybe you set up a wait-queue and start waiting for > it, or whatever". > I don't see the point of saving that single callback pointer write. I mean, it's just _one_ CPU word memory write. Likely to the L1 cache. But I can see the appeal of getting a "done" response on the _encrypt()/ _decrypt() call and then being able to immediately continue processing the result data and having the async response handling separated off. I think it should actually be possible to change the API to work like that without breaking backward compatibility, i.e. define some flag specifying you actually _want_ this behavior and then define some return code that says "I'm done processing, carry on please". > See the difference in models? One forces that asynchronous model, and > actively penalizes the synchronous one. > > The other _allows_ an asynchronous model, but is fine with a synchronous one. > > > > aead_request_set_callback(req, 0, NULL, NULL); > > > > > This is just inevitable for HW acceration ... > > See above. It really isn't. You could do it *after* the fact, > Before ... after ... the point was you need it. And it's a totally insignificant saving anyway. > when > you've gotten that ticket from the hardware. Then you say "ok, if the > ticket is done, use these callbacks". Or "I'll now wait for this > ticket to be done" (which is what the above does by setting the > callbacks to zero). > > Wouldn't that be lovely for a user? > Yes and no. Because the user would _still_ need to handle the case of callbacks. In case the request _does_ go to the HW accelerator. So you keep the main processing path clean I suppose, saving some cycles there, but you still have this case of callbacks and having multiple requests queued you need to handle as well. Which now becomes a separate _exception_ case. You now have two distinct processing paths you have to manage from your application. How is that an _improvement_ for the user? (not withstanding that it may be an improvement to SW only performance) > I suspect it would be a nice model for a hw accelerator too. If you > have full queues or have problems allocating new memory or whatever, > you just let the code fall back to the synchronous interface. > HW drivers typically _do_ use SW fallback for cases they cannot handle. Actually, that works very nicely with the current API, with the fallback cipher just being attached to the original requests' callback function ... i.e. just do a tail call to the fallback cipher request. > > Trust me, I have whole list of things I don't like about the > > API myself, it's not exacty ideal for HW acceleration either. > > That';s the thing. It's actively detrimental for "I have no HW acceleration". > You keep asserting that with no evidence whatsoeever. > And apparently it's not optimal for you guys either. > True, but I accept the fact that it needs to be that way because some _other_ HW may drive that requirement. I accept the fact that I'm not alone in the world. > > But the point is - there are those case where you _don't_ know and > > _that_ is what the Crypto API is for. And just generally, crypto > > really _should_ be switchable. > > It's very much not what wireguard does. > And that's very much a part of Wireguard that is _broken_. I like Wireguard for a lot of things, but it's single cipher focus is not one of them. Especially since all crypto it uses comes from a single source (DJB), which is frowned upon in the industry. Crypto agility is a very important _security_ feature and the whole argument Jason makes that it is actually a weakness is _bullshit_. (Just because SSL _implemented_ this horribly wrong doesn't mean it's a bad thing to do - it's not, it's actually _necessary_. As the alternative would be to either continue using broken crypto or wait _months_ for a new implementation to reach your devices when the crypto gets broken somehow. Not good.) > And honestly, most of the switchable ones have caused way more > security problems than they have "fixed" by being switchable. > "most of the switchable ones" You mean _just_ SSL/TLS. SSL/TLS before 1.3 just sucked security wise, on so many levels. That has _nothing_ to do with the very desirable feature of crypto agility. It _can_ be done properly and securely. (for one thing, it does not _need_ to be negotiable) > Linus Regards, Pascal van Leeuwen Silicon IP Architect, Multi-Protocol Engines @ Verimatrix www.insidesecure.com