Hi, On Wed, Jan 31, 2018 at 2:29 PM, Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > On Tue, 30 Jan 2018 15:51:40 +0000 > Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > >> On Tue, 30 Jan 2018 09:27:04 +0100 >> Stephan Müller <smueller@xxxxxxxxxx> wrote: > > A few clarifications from me after discussions with Stephan this morning. > >> >> > Hi Harsh, >> > >> > may I as you to try the attached patch on your Chelsio driver? Please invoke >> > the following command from libkcapi: >> > >> > kcapi -d 2 -x 9 -e -c "cbc(aes)" -k >> > 8d7dd9b0170ce0b5f2f8e1aa768e01e91da8bfc67fd486d081b28254c99eb423 -i >> > 7fbc02ebf5b93322329df9bfccb635af -p 48981da18e4bb9ef7e2e3162d16b1910 >> > >> > The result should now be a fully block-chained result. >> > >> > Note, the patch goes on top of the inline IV patch. >> > >> > Thanks >> > >> > ---8<--- >> > >> > In case of AIO where multiple IOCBs are sent to the kernel and inline IV >> > is not used, the ctx->iv is used for each IOCB. The ctx->iv is read for >> > the crypto operation and written back after the crypto operation. Thus, >> > processing of multiple IOCBs must be serialized. >> > >> > As the AF_ALG interface is used by user space, a mutex provides the >> > serialization which puts the caller to sleep in case a previous IOCB >> > processing is not yet finished. >> > >> > If multiple IOCBs arrive that all are blocked, the mutex' FIFO operation >> > of processing arriving requests ensures that the blocked IOCBs are >> > unblocked in the right order. >> > >> > Signed-off-by: Stephan Mueller <smueller@xxxxxxxxxx> > > Firstly, as far as I can tell (not tested it yet) the patch does the > job and is about the best we can easily do in the AF_ALG code. > > I'd suggest that this (or v2 anyway) is stable material as well > (which, as Stephan observed, will require reordering of the two patches). > >> >> As a reference point, holding outside the kernel (in at least >> the case of our hardware with 8K CBC) drops us down to around 30% >> of performance with separate IVs. Putting a queue in kernel >> (and hence allowing most setup of descriptors / DMA etc) gets us >> up to 50% of raw non chained performance. >> >> So whilst this will work in principle I suspect anyone who >> cares about the cost will be looking at adding their own >> software queuing and dependency handling in driver. How >> much that matters will depend on just how quick the hardware >> is vs overheads. >> <snip> > > The differences are better shown with pictures... > To compare the two approaches. If we break up the data flow the > alternatives are > > 1) Mutex causes queuing in AF ALG code > > The key thing is the [Build / Map HW Descs] for small packets, > up to perhaps 32K, is a significant task we can't avoid. > At 8k it looks like it takes perhaps 20-30% of the time > (though I only have end to end performance numbers so far) > > > [REQUEST 1 from userspace] > | [REQUEST 2 from userspace] > [AF_ALG/SOCKET] | > | [AF_ALG/SOCKET] > NOTHING BLOCKING (lock mut) | > | Queued on Mutex > [BUILD / MAP HW DESCS] | > | | > [Place in HW Queue] | > | | > [Process] | > | | > [Interrupt] | > | | > [Respond and update IV] Nothing Blocking (lock mut) > | [BUILD / MAP HW DESCS] * AFTER SERIALIZATION * > Don't care beyond here. | > [Place in HW Queue] > | > [Process] > | > [Interrupt] > | > [Respond and update IV] > | > Don't care beyond here. > > > 2) The queuing approach I did in the driver moves that serialization > to after the BUILD / MAP HW DESCs stage. > > [REQUEST 1 from userspace] > | [REQUEST 2 from userspace] > [AF_ALG/SOCKET] | > | [AF_ALG/SOCKET] > [BUILD / MAP HW DESCS] | > | [BUILD / MAP HW DESCS] * BEFORE SERIALIZATION * > NOTHING BLOCKING | > | IV in flight (enqueue sw queue) > [Place in HW Queue] | > | | > [Process] | > | | > [Interrupt] | > | | > [Respond and update IV] Nothing Blocking (dequeue sw queue) > | [Place in HW Queue] > Don't care beyond here. | > [Process] > | > [Interrupt] > | > [Respond and update IV] > | > Don't care beyond here. > >> >> This maintains chaining when needed and gets a good chunk of the lost >> performance back. I believe (though yet to verify) the remaining lost >> performance is mostly down to the fact we can't coalesce interrupts if >> chaining. >> >> Note the driver is deliberately a touch 'simplistic' so there may be other >> optimization opportunities to get some of the lost performance back or >> it may be fundamental due to the fact the raw processing can't be in >> parallel. >> > > Quoting a private email from Stephan (at Stephan's suggestion) > >> What I however could fathom is that we introduce a flag where a driver >> implements its own queuing mechanism. In this case, the mutex handling would >> be disregarded. > > Which works well for the sort of optimization I did and for hardware that > can do iv dependency tracking itself. If hardware dependency tracking was > avilable, you would be able to queue up requests with a chained IV > without taking any special precautions at all. The hardware would > handle the IV update dependencies. > > So in conclusion, Stephan's approach should work and give us a nice > small patch suitable for stable inclusion. > > However, if people know that their setup overhead can be put in parallel > with previous requests (even when the IV is not yet updated) then they will > probably want to do something inside their own driver and set the flag > that Stephan is proposing adding to bypass the mutex approach. > The patches from Stephan looks good to me, but I think we can do better for the long term approach you are discussing. Consider that the issue we are dealing with is in no way unique to the AF_ALG use case, it just happens to expose it. The grander issue is, I believe, that we are missing an API for chained crypto operations. One that allows submitting multiple concurrent but dependent requests to tfm providers. Trying to second guess whether or not there is a dependency between calls from the address of the IV is not a clean solution. Why don't we make it explicit? For example, a flag that can be set on a tfm that states that the subsequent series of requests are interdependent. If you need a different stream, simply allocated anotehr tfm object. This will let each driver do it's best, be it a simple mutex, software queuing or hardware dependency tracking as the case m may be. Than of course, AF_ALG code (or any other user for that matter) will not need to handle interdependence, just set the right flag. Do you think this makes sense? Thanks, Gilad -- Gilad Ben-Yossef Chief Coffee Drinker "If you take a class in large-scale robotics, can you end up in a situation where the homework eats your dog?" -- Jean-Baptiste Queru