Re: [PATCH v2 06/11] tpm: Add full HMAC and encrypt/decrypt session handling code

Ard Biesheuvel <ardb@xxxxxxxxxx> · Fri, 17 Feb 2023 09:49:51 +0100

On Thu, 16 Feb 2023 at 15:52, James Bottomley
<James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, 2023-02-14 at 15:36 +0100, Ard Biesheuvel wrote:
> > On Tue, 14 Feb 2023 at 15:28, James Bottomley
> > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > > On Tue, 2023-02-14 at 14:54 +0100, Ard Biesheuvel wrote:
> [...]
> > > >
> > > > Can we avoid shashes and sync skciphers at all? We have sha256
> > > > and AES library routines these days, and AES in CFB mode seems
> > > > like a good candidate for a library implementation as well - it
> > > > uses AES encryption only, and is quite straight forward to
> > > > implement. [0]
> > >
> > > Yes, sure.  I originally suggested something like this way back
> > > four years ago, but it got overruled on the grounds that if I
> > > didn't use shashes and skciphers some architectures would be unable
> > > to use crypto acceleration.  If that's no longer a consideration,
> > > I'm all for simplification of static cipher types.
> > >
>
> I now have this all implemented, and I looked over your code, so you
> can add my tested/reviewed-by to the aescfb implementation.  On the
> acceleration issue, I'm happy to ignore external accelerators because
> they're a huge pain for small fragments of encryption like the TPM, but
> it would be nice if we could integrate CPU instruction acceleration
> (like AES-NI on x86) into the library functions.
>

Agreed that async crypto makes no sense here, and it is rather
unfortunate that even use cases such as this one require the
scatterlist handling, which requires direct mapped memory etc etc

As for the accelerated algos: it wouldn't be too complicated to build
the CFB library interface around the 'AES' crypto cipher, which is
synchronous and operates on virtual addresses directly. But it should
only use ones that are constant time (like AES-NI) and not use generic
AES or the asm accelerated ones, and so this would require an
additional annotation (or an allowlist) which makes things a bit
clunky.

However, doing the math on the back of an envelope: taking arm64 as an
example, which manages ~1 cycle per byte for AES instructions and 25
cycles per byte for AES encryption using this library, processing 1k
of data takes an additional 24k cycles, which comes down to 10
microseconds on a 2.4 GHz CPU.

Given that this particular use case is about communicating with off
chip discrete components, I wonder whether spending 10 microseconds
more is going to have a noticeable impact.

> I also got a test rig to investigate arc.  It seems there is a huge
> problem with the SKCIPHER stack structure on that platform.  For
> reasons I still can't fathom, the compiler thinks it needs at least
> 0.5k of stack for this one structure.  I'm sure its something to do
> with an incorrect crypto alignment on arc, but I can't yet find the
> root cause.
>

Maybe SKCIPHER_ON_STACK() needs the same treatment as
660d2062190db131d2feaf19914e90f868fe285c?

The catch here is that if we reduce the alignment of the buffer, the
req pointer will not have the alignment of the typedef, and so we will
be lying to the compiler.

This is all a result of the way we abuse alignment to pad out data
fields that may be used for inbound non-coherent DMA, and this is
something that is being fixed at the moment.

> > I don't know if that is a consideration or not. The AES library code
> > is generic C code that was written to be constant-time, rather than
> > fast. The fact that CFB only uses the encryption side of it is
> > fortunate, because decryption is even slower.
>
> I think for the TPM, since the encryption isn't exactly bulk (it's
> really under 1k for command and response encryption) it doesn't matter
> ... in fact setting up the accelerator is likely a bigger overhead.
>
> > So the question is whether this will actually be a bottleneck in this
> > particular scenario. The synchronous accelerated AES implementations
> > are all SIMD based, which means there is some overhead, and some
> > degree of parallelism is also needed to take full advantage, and CFB
> > only allows this for decryption to begin with, as encryption uses
> > ciphertext block N-1 as AES input for encrypting block N.
> >
> > So maybe this is terrible advice, but the code will look so much
> > better for it, and we can always add back the performance later if it
> > is really an impediment.
>
> It's definitely smaller and neater, yes.  I'll post a v3 based on this,
> but when might it go upstream?  In my post I'll put your aescfb as
> patch 1 so the static checkers don't go haywire about missing function
> exports, and we can drop that patch when it is upstream.
>

I'll add some test cases and send it to the list.