Re: [PATCH v3 0/4] crypto: AF_ALG AIO improvements

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Fri, 2 Mar 2018 11:33:39 +0000

On Tue, 27 Feb 2018 15:08:58 +0100
Stephan Müller <smueller@xxxxxxxxxx> wrote:

> Am Freitag, 23. Februar 2018, 13:00:26 CET schrieb Herbert Xu:
> 
> Hi Herbert,

Hi Stephan / Herbert,

> 
> > On Fri, Feb 23, 2018 at 09:33:33AM +0100, Stephan Müller wrote:  
> > > A simple copy operation, however, will imply that in one AIO recvmsg
> > > request, only *one* IOCB can be set and processed.  
> > 
> > Sure, but the recvmsg will return as soon as the crypto API encrypt
> > or decrypt function returns.  It's still fully async.  It's just
> > that the setup part needs to be done with sendmsg/recvmsg.  
> 
> Wouldn't a copy of the ctx->iv into a per-request buffer change the behavoir 
> of the AF_ALG interface significantly?
> 
> Today, if multiple IOCBs are submitted, most cipher implementations would 
> serialize the requests (e.g. all implementations that behave synchronous in 
> nature like all software implementations).
> 
> Thus, when copying the ctx->iv into a separate per-request buffer, suddenly 
> all block-chained cipher operations are not block chained any more.

Agreed - specific handling would be needed to ensure the IV is written to
each copy to maintain the chain.  Not nice at all.

> > 
> > Even if we wanted to do what you stated, just inlining the IV isn't
> > enough.  You'd also need to inline the assoclen, and probably the
> > optype in case you want to mix encrypt/decrypt too.  
> 
> Maybe that is what we have to do.

The one element I could do with more clarity on here is use cases
as it feels like the discussion is a little unfocused (helps with
performance runs, but is it really useful?)

When do we want to have separate IVs per request but a shared key?

I think this is relevant for ctr modes in particular where userspace
can provide the relevant ctrs but the key is shared.
Storage encryption modes such as XTS can also benefit.

My own knowledge is too abstract to give good answers to these.

> > 
> > However, I must say that I don't see the point of going all the way
> > to support such a bulk submission interface (e.g., lio_listio).  
> 
> IMHO, the point is that AF_ALG is the only interface to allow userspace to 
> utilize hardware crypto implementations. For example, on a small chip with 
> hardware crypto support, your user space code can offload crypto to that 
> hardware to free CPU time.
> 
> How else would somebody access its crypto accelerators?

This is also useful at the high end where we may well be throwing this
bulk submission at a set of crypto units (hidden behind a queue) to
parallelize when possible.

Just because we have lots of cpu power doesn't mean it makes sense
to use it for crypto :)

We 'could' just do it all in userspace via vfio, but there are the usual
disadvantages in that approach in terms of generality etc.

> > 
> > Remember, the algif interface due to its inherent overhead is meant
> > for bulk data.  That is, the processing time for each request is
> > dominated by the actual processing, not the submission process.  
> 
> I see that. And for smaller chips with crypto support, this would be the case 
> IMHO. Especially if we streamline the AF_ALG overhead such that we reduce the 
> number of syscalls and user/kernel space roundtrips.

For larger devices the ability to run large numbers of requests and 'know'
that they don't need to be chained is useful (because they have separate IVs).
This allows you to let the hardware handle them in parallel (either because
the hardware handles dependency tracking, or because we have done it in the
driver.)  Applies just as well for large blocks with lower overhead.

You could do this by opening lots of separate sockets and simply providing
them all with the same key.  However, this assumes the hardware / driver can
handle very large numbers of contexts (ours can though we only implement a
subset of this functionality in the current driver to keep things simple).

If we 'fake' such support in the driver then there is inherent nastiness
around having to let the hardware queues drain before you can change the IV)
and that the overhead of operating such a pool of sockets in your program
isn't significant.  

Managing such a pool of sockets would also be a significant complexity
overhead in complexity of the user space code.

> > 
> > If you're instead processing lots of tiny requests, do NOT use
> > algif because it's not designed for that.  
> 
> The only issue in this case is that it makes the operation slower. 
> > 
> > Therefore spending too much time to optimise the submission overhead
> > seems pointless to me.
> > 
> > Cheers,  
> 
> 
> Ciao
> Stephan
> 
> 
Thanks,

Jonathan