Re: [RFC] AF_ALG AIO and IV

Stephan Mueller <smueller@xxxxxxxxxx> · Mon, 15 Jan 2018 13:07:16 +0100

Am Montag, 15. Januar 2018, 12:05:03 CET schrieb Jonathan Cameron:

Hi Jonathan,

> On Fri, 12 Jan 2018 14:21:15 +0100
> 
> Stephan Mueller <smueller@xxxxxxxxxx> wrote:
> > Hi,
> > 
> > The kernel crypto API requires the caller to set an IV in the request data
> > structure. That request data structure shall define one particular cipher
> > operation. During the cipher operation, the IV is read by the cipher
> > implementation and eventually the potentially updated IV (e.g. in case of
> > CBC) is written back to the memory location the request data structure
> > points to.
> Silly question, are we obliged to always write it back?

Well, in general, yes. The AF_ALG interface should allow a "stream" mode of 
operation:

socket
accept
setsockopt(setkey)
sendmsg(IV, data)
recvmsg(data)
sendmsg(data)
recvmsg(data)
..

For such synchronous operation, I guess it is clear that the IV needs to be 
written back.

If you want to play with it, use the "stream" API of libkcapi and the 
associated test cases.

> In CBC it is
> obviously the same as the last n bytes of the encrypted message.  I guess
> for ease of handling it makes sense to do so though.
> 
> > AF_ALG allows setting the IV with a sendmsg request, where the IV is
> > stored in the AF_ALG context that is unique to one particular AF_ALG
> > socket. Note the analogy: an AF_ALG socket is like a TFM where one
> > recvmsg operation uses one request with the TFM from the socket.
> > 
> > AF_ALG these days supports AIO operations with multiple IOCBs. I.e. with
> > one recvmsg call, multiple IOVECs can be specified. Each individual IOCB
> > (derived from one IOVEC) implies that one request data structure is
> > created with the data to be processed by the cipher implementation. The
> > IV that was set with the sendmsg call is registered with the request data
> > structure before the cipher operation.
> > 
> > In case of an AIO operation, the cipher operation invocation returns
> > immediately, queuing the request to the hardware. While the AIO request is
> > processed by the hardware, recvmsg processes the next IOVEC for which
> > another request is created. Again, the IV buffer from the AF_ALG socket
> > context is registered with the new request and the cipher operation is
> > invoked.
> > 
> > You may now see that there is a potential race condition regarding the IV
> > handling, because there is *no* separate IV buffer for the different
> > requests. This is nicely demonstrated with libkcapi using the following
> > command which creates an AIO request with two IOCBs each encrypting one
> > AES block in CBC mode:
> > 
> > kcapi  -d 2 -x 9  -e -c "cbc(aes)" -k
> > 8d7dd9b0170ce0b5f2f8e1aa768e01e91da8bfc67fd486d081b28254c99eb423 -i
> > 7fbc02ebf5b93322329df9bfccb635af -p 48981da18e4bb9ef7e2e3162d16b1910
> > 
> > When the first AIO request finishes before the 2nd AIO request is
> > processed, the returned value is:
> > 
> > 8b19050f66582cb7f7e4b6c873819b7108afa0eaa7de29bac7d903576b674c32
> > 
> > I.e. two blocks where the IV output from the first request is the IV input
> > to the 2nd block.
> > 
> > In case the first AIO request is not completed before the 2nd request
> > commences, the result is two identical AES blocks (i.e. both use the same
> > IV):
> > 
> > 8b19050f66582cb7f7e4b6c873819b718b19050f66582cb7f7e4b6c873819b71
> > 
> > This inconsistent result may even lead to the conclusion that there can be
> > a memory corruption in the IV buffer if both AIO requests write to the IV
> > buffer at the same time.
> > 
> > This needs to be solved somehow. I see the following options which I would
> > like to have vetted by the community.
> 
> Taking some 'entirely hypothetical' hardware with the following structure
> for all my responses - it's about as flexible as I think we'll see in the
> near future - though I'm sure someone has something more complex out there
> :)
> 
> N hardware queues feeding M processing engines in a scheduler driven
> fashion. Actually we might have P sets of these, but load balancing and
> tracking and transferring contexts between these is a complexity I think we
> can ignore. If you want to use more than one of these P you'll just have to
> handle it yourself in userspace.  Note messages may be shorter than IOCBs
> which raises another question I've been meaning to ask.  Are all crypto
> algorithms obliged to run unlimited length IOCBs?

There are instances where hardware may reject large data chunks. IIRC I have 
seen some limits around 32k. But in this case, the driver must chunk up the 
scatter-gather lists (SGLs) with the data and feed it to the hardware in the 
chunk size necessary.

>From the kernel crypto API point of view, the driver must support unlimited 
sized IOCBs / SGLs.
> 
> If there are M messages in a particular queue and none elsewhere it is
> capable of processing them all at once (and perhaps returning out of order
> but we can fudge them back in order in the driver to avoid that additional
> complexity from an interface point of view).
> 
> So I'm going to look at this from the hardware point of view - you have
> well addressed software management above.
> 
> Three ways context management can be handled (in CBC this is basically just
> the IV).
> 
> 1. Each 'work item' queued on a hardware queue has it's IV embedded with the
> data.  This requires external synchronization if we are chaining across
> multiple 'work items' - note the hardware may have restrictions that mean
> it has to split large pieces of data up to encrypt them.  Not all hardware
> may support per 'work item' IVs (I haven't done a survey to find out if
> everyone does...)
> 
> 2. Each queue has a context assigned.  We get a new queue whenever we want
> to have a different context.  Runs out eventually but our hypothetical
> hardware may support a lot of queues.  Note this version could be 'faked'
> by putting a cryptoengine queue on the front of the hardware queues.
> 
> 3. The hardware supports IV dependency tracking in it's queues.  That is,
> it can check if the address pointing to the IV is in use by one of the
> processing units which has not yet updated the IV ready for chaining with
> the next message.  Note it might use a magic token rather than the IV
> pointer.  For modes with out chaining (including counter modes) the IV
> pointer will inherently always be different.
> The hardware then simply schedules something else until it can safely
> run that particular processing unit.

The kernel crypto API has the following concept:

- a TFM holds the data that is stable for an entire cipher operation, such as 
the key -- one cipher operation may consist of individual calls

- a request structure holds the volatile data, i.e. the data that is valid for 
one particular call, such as the input plaintext or the IV

Thus, your hardware queue should expect one request and it must be capable of 
handling that one request with the given data. If you want to split up the 
request because you have sufficient hardware resources as you mentioned above, 
your driver/hardware must process the request accordingly.

Coming back to the AF_ALG interface: in order to support the aforementioned 
"stream" mode, the requests for each cipher call invoked by one recvmsg 
syscall points to the same IV buffer.

In case of AIO with multiple IOCBs, user space conceptually calls:

sendmsg
recvmsg
recvmsg
..

where all recvmsg calls execute in parallel. As each recvmsg call has one 
request associated with it, the question is what happens to a buffer that is 
pointed to by multiple request structures in such parallel execution.

If your hardware is capable of serializing the recvmsg calls or tracking the 
dependency, the current AF_ALG interface is fully sufficient.

But there may be hardware that cannot/will not track such dependencies. Yet, 
it has multiple hardware queues. Such hardware can still handle parallel 
requests when they are totally independent from each other. For such a case, 
AF_ALG currently has no support, because it lacks the support for setting 
multiple IVs for the multiple concurrent calls.

> > 1. Require that the cipher implementations serialize any AIO requests that
> > have dependencies. I.e. for CBC, requests need to be serialized by the
> > driver. For, say, ECB or XTS no serialization is necessary.
> 
> There is a certain requirement to do this anyway as we may have a streaming
> type situation and we don't want to have to do the chaining in userspace.

Absolutely. If you have proper hardware/driver support, that would be 
beneficial. This would be supported with the current AF_ALG interface.

But I guess there are also folks out there who simply want to offer multiple 
hardware queues to allow independent cipher operations to be invoked 
concurrently without any dependency handling. This is currently not supported 
with AF_ALG.
> 
> So we send first X MB block to HW but before it has come back we have more
> data arrive that needs decrypting so we queue that behind it.  The IV
> then needs to be updated automatically (or the code needs to do it on the
> first work item coming back). If you don't have option 3 above, you
> have to do this.  This is what I was planning to implement for our existing
> hardware before you raised this question and I don't think we get around
> it being necessary for performance in any case. Setting up IOMMUs etc is
> costly so we want to be doing everything we can before the IV update is
> ready.
> 
> > 2. Change AF_ALG to require a per-request IV. This could be implemented by
> > moving the IV submission via CMSG from sendmsg to recvmsg. I.e. the
> > recvmsg
> > code path would obtain the IV.
> > 
> > I would tend to favor option 2 as this requires code change at only
> > location. If option 2 is considered, I would recommend to still allow
> > setting the IV via sendmsg CMSG (to keep the interface stable). If,
> > however, the caller provides an IV via recvmsg, this takes precedence.
> 
> We definitely want to keep option 1 (which runs on the existing interface
> and does the magic in driver) for those who want it.

Agreed.
> 
> So the only one left is the case 3 above where the hardware is capable
> of doing the dependency tracking.
> 
> We can support that in two ways but one is rather heavyweight in terms of
> resources.
> 
> 1) Whenever we want to allocate a new context we spin up a new socket and
> effectively associate a single IV with that (and it's chained updates) much
> like we do in the existing interface.

I would not like that because it is too heavyweight. Moreover, considering the 
kernel crypto API logic, a socket is the user space equivalent of a TFM. I.e. 
for setting an IV, you do not need to re-instantiate a TFM.
> 
> 2) We allow a token based tracking of IVs.  So userspace code maintains
> a counter and tags ever message and the initial IV setup with that counter.

I think the option I offer with the patch, we have an even more lightweight 
approach.
> 
> As the socket typically belongs to a userspace process tag creation can
> be in userspace and it can ensure it doesn't overlap tags (or it'll get
> the wrong answer).
> 
> Kernel driver can then handle making sure any internal token / addresses
> are correct.  I haven't looked at in depth but would imagine this one
> would be rather more invasive to support.
> 
> > If there are other options, please allow us to learn about them.
> 
> Glad we are addressing these usecases and that we have AIO support in
> general.  Makes for a better discussion around whether in kernel support
> for these interfaces is actually as effective as moving to userspace
> drivers...

:-)

I would like to have more code in user space than in kernel space...

Ciao
Stephan