Hi Ard, On Thu, Sep 26, 2019 at 2:07 PM Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: > attitude goes counter to that, and this is why we have made so little > progress over the past year. I also just haven't submitted much in the past year, taking a bit of a break to see how things would settle. Seemed like rushing things wasn't prudent, so I slowed down. > But I am happy with your willingness to collaborate and find common > ground, which was also my motivation for spending a considerable > amount of time to prepare this patch set. Super. > > If the process of doing that together will be fraught with difficulty, > > I’m still open to the “7 patch series” with the ugly cryptoapi.c > > approach, as described at the top. > > If your aim is to write ugly code and use that as a munition No, this is not a matter of munition at all. Please take my words seriously; I am entirely genuine here. Three people I greatly respect made a very compelling argument to me, prompting the decision in [1]. The argument was that trying to fix the crypto layer AND trying to get WireGuard merged at the same time was ambitious and crazy. Maybe instead, they argued, I should just use the old crypto API, get WireGuard in, and then begin the Zinc process after. I think procedurally, that's probably good advice, and the people I was talking to seemed to have a really firm grasp on what works and what doesn't in the mainlining process. Now it's possible their judgement is wrong, but I really am open, in earnest, to following it. And the way that would look would be not trying to fix the crypto API now, getting WireGuard in based on whatever we can cobble together based on the current foundations with some intermediate file (cryptoapi.c in my previous email) to prevent it from infecting WireGuard. This isn't "munition"; it's a serious proposal. The funny thing, though, is that all the while I was under the impression somebody had figured out a great way to do this, it turns out you were busy with basically Zinc-but-not. So we're back to square one: you and I both want the crypto API to change, and now we have to figure out a way forward together on how to do this, prompting my last email to you, indicating that I was open to all sorts of compromises. However, I still remain fully open to following the prior suggestion, of not doing that at all right now, and instead basing this on the existing crypto API as-is. [1] https://lore.kernel.org/wireguard/CAHmME9pmfZAp5zd9BDLFc2fWUhtzZcjYZc2atTPTyNFFmEdHLg@xxxxxxxxxxxxxx/ > > reference, here’s what that kind of thing looks like: [2]. > > This is one of the issues in the 'fix it for everyone else as well' > category. If we can improve the crypto API to be less susceptible to > these issues (e.g., using static calls), everybody benefits. I'd be > happy to collaborate on that. Good. I'm happy to learn that you're all for having fast implementations that underlie the simple function calls. > > Taken together, we simply can’t skimp on the implementations available > > on the handshake layer, so we’ll need to add some form of > > implementation selection, whether it’s the method Zinc uses ([2]), or > > something else we cook up together. > > So are you saying that the handshake timing constraints in the > WireGuard protocol are so stringent that we can't run it securely on, > e.g., an ARM CPU that lacks a NEON unit? Or given that you are not > providing accelerated implementations of blake2s or Curve25519 for > arm64, we can't run it securely on arm64 at all? Deployed at scale, the handshake must have a certain performance to not be DoS'd. I've spent a long time benching these and attacking my own code. I won't be comfortable with this going in without the fast implementations for the handshake. And down the line, too, we can look into how to even improve the DoS resistance. I think there's room for improvement, and I hope at some point you'll join us in discussions on WireGuard internals. But the bottom line is that we need those fast primitives. > Typically, I would prefer to only introduce different versions of the > same algorithm if there is a clear performance benefit for an actual > use case. As I was saying, this is indeed the case. > Framing this as a security issue rather than a performance issue is > slightly disingenuous, since people are less likely to challenge it. I'm not being disingenuous. DoS resistance is a real issue with WireGuard. You might argue that FourQ and Siphash would have made better choices, and that's an interesting discussion, but it is what it is. The thing needs fast implementations. And we're going to have to implement that code anyway for other things, so might as well get it working well now. > But the security of any VPN protocol worth its salt You're not required to use WireGuard. > Parsing the string and connecting the function pointers happens only > once, and only when the transform needs to be instantiated from its > constituent parts. Subsequent invocations will just grab the existing > object. That's good to know. It doesn't fully address the issue, though. > My preference would be to address this by permitting per-request keys > in the AEAD layer. That way, we can instantiate the transform only > once, and just invoke it with the appropriate key on the hot path (and > avoid any per-keypair allocations) That'd be a major improvement to the async interface, yes. > > So given the above, how would you like to proceed? My personal > > preference would be to see you start with the Zinc patchset and rename > > things and change the infrastructure to something that fits your > > preferences, and we can see what that looks like. Less appealing would > > be to do several iterations of you reworking Zinc from scratch and > > going through the exercises all over again, but if you prefer that I > > guess I could cope. Alternatively, maybe this is a lot to chew on, and > > we should just throw caution into the wind, implement cryptoapi.c for > > WireGuard (as described at the top), and add C functions to the crypto > > API sometime later? This is what I had envisioned in [1]. > It all depends on whether we are interested in supporting async > accelerators or not, and it is clear what my position is on this > point. For a first version of WireGuard, no, I'm really not interested in that. Adding it in there is more ambitious than it looks to get it right. Async means more buffers, which means the queuing system for WireGuard needs to be changed. There's already ongoing research into this, and I'm happy to consider that research with a light toward maybe having async stuff in the future. But sticking into the code now as-is simply does not work from a buffering/queueing perspective. So again, let's take an iterative approach here: first we do stuff with the simple synchronous API. After the dust has settled, hardware is available for testing, Van Jacobson has been taken off the bookshelf for a fresh reading, and we've all sat down for a few interesting conversations at netdev on queueing and bufferbloat, then let's start working this in. In otherwords, just because technically you can glue those APIs together, sort of, doesn't mean that approach makes sense for the system as a whole. > I am not convinced that we need accelerated implementations of blake2s > and curve25519, but if we do, I'd like those to be implemented as > individual modules under arch/*/crypto, with some moduleloader fu for > weak symbols or static calls thrown in if we have to. Exposing them as > shashes seems unnecessary to me at this point. We need the accelerated implementations. And we'll need it for chapoly too, obviously. So let's work out a good way to hook that all into the Zinc-style interface. [2] does it in a very effective way that's overall quite good for performance and easy to follow. The chacha20-x86_64-glue.c code itself gets called via the static symbol chacha20_arch. This is implemented for each platform with a fall back to one that returns false, so that the generic code is called. The Zinc stuff here is obvious, simple, and I'm pretty sure you know what's up with it. I prefer each of these glue implementations to live in lib/zinc/chacha20/chacha20-${ARCH}-glue.c. You don't like that and want things in arch/${ARCH}/crypto/chacha20-glue.c. Okay, sure, fine, let's do all the naming and organization and political stuff how you like, and I'll leave aside my arguments about why I disagree. Let's take stock of where that leaves us, in terms of files: - lib/crypto/chacha20.c: this has a generic implementation, but at the top of the generic implementation, it has some code like "if (chacha20_arch(..., ..., ...)) return;" - arch/crypto/x86_64/chacha20-glue.c: this has the chacha20_arch() implementation, which branches out to the various SIMD implementations depending on some booleans calculated at module load time. - arch/crypto/arm/chacha20-glue.c: this has the chacha20_arch() implementation, which branches out to the various SIMD implementations depending on some booleans calculated at module load time. - arch/crypto/mips/chacha20-glue.c: this has the chacha20_arch() implementation, which contains an assembly version that's always run unconditionally. Our goals are that chacha20_arch() from each of these arch glues gets included in the lib/crypto/chacha20.c compilation unit. The reason why we want it in its own unit is so that the inliner can get rid of unreached code and more tightly integrate the branches. For the MIPS case, the advantage is clear. Here's how Zinc handles it: [3]. Some simple ifdefs and includes. Easy and straightforward. Some people might object, though, and it sounds like you might. So let's talk about some alternative mechanisms with their pros and cons: - The zinc method: straightforward, but not everybody likes ifdefs. - Stick the filename to be included into a Kconfig variable and let the configuration system do the logic: also straightforward. Not sure it's kosher, but it would work. - Weak symbols: we don't get inlining or the dead code elimination. - Function pointers: ditto, plus spectre. - Other ideas you might have? I'm open to suggestions here. [2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54 [3] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n19 > What I *don't* want is to merge WireGuard with its own library based > crypto now, and extend that later for async accelerators once people > realize that we really do need that as well. I wouldn't worry so much about that. Zinc/library-based-crypto is just trying to fulfill the boring synchronous pure-code part of crypto implementations. For the async stuff, we can work together on improving the existing crypto API to be more appealing, in tandem with some interesting research into packet queuing systems. From the other thread, you might have seen that already Toke has cool ideas that I hope we can all sit down and talk about. I'm certainly not interested in "bolting" anything on to Zinc/library-based-crypto, and I'm happy to work to improve the asynchronous API for doing asynchronous crypto. Jason