RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption

Pascal Van Leeuwen <pvanleeuwen@xxxxxxxxxxxxxx> · Fri, 27 Sep 2019 10:11:55 +0000

> -----Original Message-----
> From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Sent: Friday, September 27, 2019 4:06 AM
> To: Pascal Van Leeuwen <pvanleeuwen@xxxxxxxxxxxxxx>
> Cc: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>; Linux Crypto Mailing List <linux-
> crypto@xxxxxxxxxxxxxxx>; Linux ARM <linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>; Herbert Xu
> <herbert@xxxxxxxxxxxxxxxxxxx>; David Miller <davem@xxxxxxxxxxxxx>; Greg KH
> <gregkh@xxxxxxxxxxxxxxxxxxx>; Jason A . Donenfeld <Jason@xxxxxxxxx>; Samuel Neves
> <sneves@xxxxxxxxx>; Dan Carpenter <dan.carpenter@xxxxxxxxxx>; Arnd Bergmann
> <arnd@xxxxxxxx>; Eric Biggers <ebiggers@xxxxxxxxxx>; Andy Lutomirski <luto@xxxxxxxxxx>;
> Will Deacon <will@xxxxxxxxxx>; Marc Zyngier <maz@xxxxxxxxxx>; Catalin Marinas
> <catalin.marinas@xxxxxxx>
> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet
> encryption
> 
> On Thu, Sep 26, 2019 at 5:15 PM Pascal Van Leeuwen
> <pvanleeuwen@xxxxxxxxxxxxxx> wrote:
> >
> > But even the CPU only thing may have several implementations, of which
> > you want to select the fastest one supported by the _detected_ CPU
> > features (i.e. SSE, AES-NI, AVX, AVX512, NEON, etc. etc.)
> > Do you think this would still be efficient if that would be some
> > large if-else tree? Also, such a fixed implementation wouldn't scale.
> 
> Just a note on this part.
> 
> Yes, with retpoline a large if-else tree is actually *way* better for
> performance these days than even just one single indirect call. I
> think the cross-over point is somewhere around 20 if-statements.
> 
Yikes, that is just _horrible_ :-(

_However_ there's many CPU architectures out there that _don't_ need
the retpoline mitigation and would be unfairly penalized by the deep
if-else tree (as opposed to the indirect branch) for a problem they
did not cause in the first place.

Wouldn't it be more fair to impose the penalty on the CPU's actually
_causing_ this problem? Also because those are generally the more 
powerful CPU's anyway, that would suffer the least from the overhead?

> But those kinds of things also are things that we already handle well
> with instruction rewriting, so they can actually have even less of an
> overhead than a conditional branch. Using code like
> 
>   if (static_cpu_has(X86_FEATURE_AVX2))
> 
> actually ends up patching the code at run-time, so you end up having
> just an unconditional branch. Exactly because CPU feature choices
> often end up being in critical code-paths where you have
> one-or-the-other kind of setup.
> 
> And yes, one of the big users of this is very much the crypto library code.
> 
Ok, I didn't know about that. So I suppose we could have something
like if (static_soc_has(HW_CRYPTO_ACCELERATOR_XYZ)) ... Hmmm ...

> The code to do the above is disgusting, and when you look at the
> generated code you see odd unreachable jumps and what looks like a
> slow "bts" instruction that does the testing dynamically.
> 
> And then the kernel instruction stream gets rewritten fairly early
> during the boot depending on the actual CPU capabilities, and the
> dynamic tests get overwritten by a direct jump.
> 
> Admittedly I don't think the arm64 people go to quite those lengths,
> but it certainly wouldn't be impossible there either.  It just takes a
> bit of architecture knowledge and a strong stomach ;)
> 
>                  Linus

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com