Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



 Hi Ard,

Thanks for taking the initiative on this. When I first discussed with
DaveM porting WireGuard to the crypto API and doing Zinc later
yesterday, I thought to myself, “I wonder if Ard might want to work on
this with me…” and sent you a message on IRC. It didn’t occur to me
that you were the one who had pushed this endeavor!

I must admit, though, I’m a bit surprised to see how it’s appearing.
When I wrote [1], I had really imagined postponing the goals of Zinc
entirely, and instead writing a small shim that calls into the
existing crypto API machinery. I imagined the series to look like
this:

1. Add blake2s generic as a crypto API shash object.
2. Add blake2s x86_64 as a crypto API shash object.
3. Add curve25519 generic as a crypto API dh object.
4. Add curve25519 x86_64 as a crypto API dh object.
5. Add curve25519 arm as a crypto API dh object.
6. The unmodified WireGuard commit.
7. A “cryptoapi.c” file for WireGuard that provides definitions of the
“just functions” approach of Zinc, but does so in terms of the crypto
API’s infrastructure, with global per-cpu lists and a few locks to
handle quick buffer and tfm reuse.

I wouldn’t expect (7) to be pretty, for the various reasons that most
people dislike the crypto API, but at least it would somewhat “work”,
not affect the general integrity of WireGuard, and provide a clear
path forward in an evolutionary manner for gradually, piecemeal,
swapping out pieces of that for a Zinc-like thing, however that winds
up appearing.

Instead what we’ve wound up with in this series is a Frankenstein’s
monster of Zinc, which appears to have basically the same goal as
Zinc, and even much of the same implementation just moved to a
different directory, but then skimps on making it actually work well
and introduces problems. (I’ll elucidate on some specific issues later
in this email so that we can get on the same page with regards to
security requirements for WireGuard.) I surmise from this Zinc-but-not
series is that what actually is going on here is mostly some kind of
power or leadership situation, which is what you’ve described to me
also at various other points and in person. I also recognize that I am
at least part way to blame for whatever dynamic there has stagnated
this process; let me try to rectify that:

A principle objection you’ve had is that Zinc moves to its own
directory, with its own name, and tries to segment itself off from the
rest of the crypto API’s infrastructure. You’ve always felt this
should be mixed in with the rest of the crypto API’s infrastructure
and directory structures in one way or another. Let’s do both of those
things – put this in a directory structure you find appropriate and
hook this into the rest of the crypto API’s infrastructure in a way
you find appropriate. I might disagree, which is why Zinc does things
the way it does, but I’m open to compromise and doing things more your
way.

Another objection you’ve had is that Zinc replaces many existing
implementations with its own. Martin wasn’t happy about that either.
So let’s not do that, and we’ll have some wholesale replacement of
implementations in future patchsets at future dates discussed and
benched and bikeshedded independently from this.

Finally, perhaps most importantly, Zinc’s been my design rather than
our design. Let’s do this together instead of me git-send-email(1)-ing
a v37.

If the process of doing that together will be fraught with difficulty,
I’m still open to the “7 patch series” with the ugly cryptoapi.c
approach, as described at the top. But I think if we start with Zinc
and whittle it down in accordance with the above, we’ll get something
mutually acceptable, and somewhat similar to this series, with a few
important exceptions, which illustrate some of the issues I see in
this RFC:

Issue 1) No fast implementations for the “it’s just functions” interface.

This is a deal breaker. I know you disagree here and perhaps think all
dynamic dispatch should be by loadable modules configured with
userspace policy and lots of function pointers and dynamically
composable DSL strings, as the current crypto API does it. But I think
a lot of other people agree with me here (and they’ve chimed in
before) that the branch predictor does things better, doesn’t have
Spectre issues, and is very simple to read and understand. For
reference, here’s what that kind of thing looks like: [2].

In this case, the relevance is that the handshake in WireGuard is
extremely performance sensitive, in order to fend off DoS. One of the
big design gambits in WireGuard is – can we make it 1-RTT to reduce
the complexity of the state machine, but keep the crypto efficient
enough that this is still safe to do from a DoS perspective. The
protocol succeeds at this goal, but in many ways, just by a hair when
at scale, and so I’m really quite loathe to decrease handshake
performance. Here’s where that matters specifically:

- Curve25519 does indeed always appear to be taking tiny 32 byte stack
inputs in WireGuard. However, your statement, “the fact that they
operate on small, fixed size buffers means that there is really no
point in providing alternative, SIMD based implementations of these,
and we can limit ourselves to generic C library version,” is just
plain wrong in this case. Curve25519 only ever operates on 32 byte
inputs, because these represent curve scalars and points. It’s not
like a block cipher where parallelism helps with larger inputs or
something. In this case, there are some pretty massive speed
improvements between the generic C implementations and the optimized
ones. Like huge. On both ARM and on Intel. And Curve25519 is the most
expensive operation in WireGuard, and each handshake message invokes a
few of them. (Aside - Something to look forward to: I’m in the process
of getting a formally verified x86_64 ADX implementation ready for
kernel usage, to replace our existing heavily-fuzzed one, which will
be cool.)

- Blake2s actually does benefit from the optimized code even for
relatively short inputs. While you might have been focused on the
super-super small inputs in noise.c, there are slightly larger ones in
cookie.c, and these are the most sensitive computations to make in
terms of DoS resistance; they’re on the “front lines” of the battle,
if you will. (Aside - Arguably WireGuard may have benefited from using
siphash with 128-bit outputs here, or calculated some security metrics
for DoS resistance in the face of forged 64-bit outputs or something,
or a different custom MAC, but hindsight is 20/20.)

- While 25519 and Blake2s are already in use, the optimized versions
of ChaPoly wind up being faster as well, even if it’s just hitting the
boring SSE code.

- On MIPS, the optimized versions of ChaPoly are a necessity. They’re
boring integer/scalar code, but they do things that the compiler
simply cannot do on the platform and we benefit immensely from it.

Taken together, we simply can’t skimp on the implementations available
on the handshake layer, so we’ll need to add some form of
implementation selection, whether it’s the method Zinc uses ([2]), or
something else we cook up together.

Issue 2) Linus’ objection to the async API invasion is more correct
than he realizes.

I could re-enumerate my objections to the API there, but I think we
all get it. It’s horrendous looking. Even the introduction of the
ivpad member (what on earth?) in the skb cb made me shutter. But
there’s actually another issue at play:

wg_noise_handshake_begin_session→derive_keys→symmetric_key_init is all
part of the handshake. We cannot afford to allocate a brand new crypto
object, parse the DSL string, connect all those function pointers,
etc. The allocations involved here aren’t really okay at all in that
path. That’s why the cryptoapi.c idea above involves just using a pool
of pre-allocated objects if we’re going to be using that API at all.
Also keep in mind that WireGuard instances sometimes have hundreds of
thousands of peers.

I’d recommend leaving this synchronous as it exists now, as Linus
suggested, and we can revisit that later down the road. There are a
number of improvements to the async API we could make down the line
that could make this viable in WireGuard. For example, I could imagine
decoupling the creation of the cipher object from its keys and
intermediate buffers, so that we could in fact allocate the cipher
objects with their DSLs globally in a safe way, while allowing the
keys and working buffers to come from elsewhere. This is deep plumbing
into the async API, but I think we could get there in time.

There’s also a degree of practicality: right now there is zero ChaPoly
async acceleration hardware anywhere that would fit into the crypto
API. At some point, it might come to exist and have incredible
performance, and then we’ll both feel very motivated to make this work
for WireGuard. But it might also not come to be (AES seems to have won
over most of the industry), in which case, why hassle?

Issue 3) WireGuard patch is out of date.

This is my fault, because I haven’t posted in a long time. There are
some important changes in the main WireGuard repo. I’ll roll another
patch soon for this so we have something recent to work off of. Sorry
about that.

Issue 4) FPU register batching?

When I introduced the simd_get/simd_put/simd_relax thing, people
seemed to think it was a good idea. My benchmarks of it showed
significant throughput improvements. Your patchset doesn’t have
anything similar to this. But on the other hand, last I spoke with the
x86 FPU guys, I thought they might actually be in the process of
making simd_get/put obsolete with some internal plumbing to make
restoration lazier. I’ll see tglx later today and will poke him about
this, as this might already be a non-issue.


So given the above, how would you like to proceed? My personal
preference would be to see you start with the Zinc patchset and rename
things and change the infrastructure to something that fits your
preferences, and we can see what that looks like. Less appealing would
be to do several iterations of you reworking Zinc from scratch and
going through the exercises all over again, but if you prefer that I
guess I could cope. Alternatively, maybe this is a lot to chew on, and
we should just throw caution into the wind, implement cryptoapi.c for
WireGuard (as described at the top), and add C functions to the crypto
API sometime later? This is what I had envisioned in [1].

And for the avoidance of doubt, or in case any of the above message
belied something different, I really am happy and relieved to have an
opportunity to work on this _with you_, and I am much more open than
before to compromise and finding practical solutions to the past
political issues. Also, if you’re into chat, we can always spec some
of the nitty-gritty aspects out over IRC or even the old-fashioned
telephone. Thanks again for pushing this forward.

Regards,
Jason

[1] https://lore.kernel.org/wireguard/CAHmME9pmfZAp5zd9BDLFc2fWUhtzZcjYZc2atTPTyNFFmEdHLg@xxxxxxxxxxxxxx/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54




[Index of Archives]     [Kernel]     [Gnu Classpath]     [Gnu Crypto]     [DM Crypt]     [Netfilter]     [Bugtraq]

  Powered by Linux