RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption

Pascal Van Leeuwen <pvanleeuwen@xxxxxxxxxxxxxx> · Thu, 26 Sep 2019 09:40:42 +0000

> >
> > Replace the chacha20poly1305() library calls with invocations of the
> > RFC7539 AEAD, as implemented by the generic chacha20poly1305 template.
> 
> Honestly, the other patches look fine to me from what I've seen (with
> the small note I had in a separate email for 11/18), but this one I
> consider just nasty, and a prime example of why people hate those
> crypto lookup routines.
> 
> Some of it is just the fundamental and pointless silly indirection,
> that just makes things harder to read, less efficient, and less
> straightforward.
> 
> That's exemplified by this part of the patch:
> 
> >  struct noise_symmetric_key {
> > -       u8 key[NOISE_SYMMETRIC_KEY_LEN];
> > +       struct crypto_aead *tfm;
> 
> which is just one of those "we know what we want and we just want to
> use it directly" things, and then the crypto indirection comes along
> and makes that simple inline allocation of a small constant size
> (afaik it is CHACHA20POLY1305_KEY_SIZE, which is 32) be another
> allocation entirely.
> 
> And it's some random odd non-typed thing too, so then you have that
> silly and stupid dynamic allocation using a name lookup:
> 
>    crypto_alloc_aead("rfc7539(chacha20,poly1305)", 0, CRYPTO_ALG_ASYNC);
> 
> to create what used to be (and should be) a simple allocation that was
> has a static type and was just part of the code.
> 
While I agree with the principle of first merging Wireguard without 
hooking it up to the Crypto API and doing the latter in a later,
separate patch, I DONT'T agree with your bashing of the Crypto API
or HW crypto acceleration in general.

Yes, I do agree  that if you need to do the occasional single crypto 
op for a fixed algorithm on a small amount of data then you should
just use a simple direct  library call. I'm all for a Zinc type 
library for that.
(and I believe Ard is actually actively making such changes already)

However, if you're doing bulk crypto like network packet processing
(as Wireguard does!) or disk/filesystem encryption, then that cipher
allocation only happens once every blue moon and the overhead for
that is totally *irrelevant* as it is amortized over many hours or 
days of runtime.

While I generally dislike this whole hype of storing stuff in
textual formats like XML and JSON and then wasting lots of CPU
cycles on parsing that, I've learned to appreciate the power of
these textual Crypto API templates, as they allow a hardware 
accelerator to advertise complex combined operations as single
atomic calls, amortizing the communication overhead between SW
and HW. It's actually very flexible and powerful!

> It also ends up doing other bad things, ie that packet-time
> 
> +       if (unlikely(crypto_aead_reqsize(key->tfm) > 0)) {
> +               req = aead_request_alloc(key->tfm, GFP_ATOMIC);
> +               if (!req)
> +                       return false;
> 
> thing that hopefully _is_ unlikely, but that's just more potential
> breakage from that whole async crypto interface.
> 
> This is what people do *not* want to do, and why people often don't
> like the crypto interfaces.
> 
Life is all about needing to do things you don't like to do ...
If you want the performance, you need to do the effort. That simple.
HW acceleration surely won't work from a naive synchronous interface.
(Same applies to running crypto in a separate thread on the CPU BTW!)

In any case, Wireguard bulk crypto *should* eventually run on top
of Crypto API such that it can leverage *existing* HW acceleration.
It would be incredibly silly not to do so, given the HW exists!

> And I'm still not convinced (a) ever makes sense - the overhead of any
> accelerator is just high enought that I doubt you'll have numbers -
> performance _or_ power.
> 
You shouldn't make such assertions if you obviously don't know what
you're talking about. Yes, there is significant overhead on the CPU
for doing lookaside crypto, but it's (usually) nothing compared to
doing the actual crypto itself on the CPU barring a few exceptions. 
(Notably AES-GCM or AES-CTR on ARM64 or x64 CPU's and *maybe* 
Chacha-Poly on recent Intel CPU's - but there's a *lot* more crypto 
being used out there than just AES-GCM and Chacha-Poly, not to 
mention a lot more less capable (embedded) CPU's running Linux)

For anything but those exceptions, we blow even the fastest Intel
server CPU's out of the water with our crypto accelerators.
(I can bore you with some figures actually measured with the
Crypto API on our HW, once I'm done optimizing the driver and I 
have some time to collect the results)

And in any case, for somewhat larger blocks/packets, the overhead
on the CPU would at least be such that it's less than what the CPU
would need to do the crypto itself - even if it's faster - such that
there is room there to do *other*, presumably more useful, work.

Then there's indeed the power consumption issue, which is complex
because crypto power != total system power so it depends too much on
the actual use case to make generic statements on it. So I'll leave
that with the remark that Intel server CPU's have to seriously
throttle down their clock if you start using AVX512 for crypto, just to
stay within their power budget, while we can do the same performance
(~200 Gbps) in just a few (~2) Watts on a similar technology node.
(excluding the CPU management overhead, but that surely won't consume
excessive power like AVX512)

> But even if you're right that it might be a power advantage on some
> platform, that wouldn't make it an advantage on other platforms. Maybe
> it could be done as a config option where you can opt in to the async
> interface when that makes sense - but not force the indirection and
> extra allocations when it doesn't. As a separate patch, something like
> that doesn't sound horrendous (and I think that's also an argument for
> doing that CPU->LE change as an independent change).
> 
Making it a switch sounds good to me though.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com