> On Sep 26, 2019, at 6:38 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > - let the caller know what the state size is and allocate the > synchronous state in its own data structures > > - let the caller just call a static "decrypt_xyz()" function for xyz > decryptrion. > > - if you end up doing it synchronously, that function just returns > "done". No overhead. No extra allocations. No unnecessary stuff. Just > do it, using the buffers provided. End of story. Efficient and simple. > > - BUT. > > - any hardware could have registered itself for "I can do xyz", and > the decrypt_xyz() function would know about those, and *if* it has a > list of accelerators (hopefully sorted by preference etc), it would > try to use them. And if they take the job (they might not - maybe > their queues are full, maybe they don't have room for new keys at the > moment, which might be a separate setup from the queues), the > "decrypt_xyz()" function returns a _cookie_ for that job. It's > probably a pre-allocated one (the hw accelerator might preallocate a > fixed number of in-progress data structures). To really do this right, I think this doesn't go far enough. Suppose I'm trying to implement send() over a VPN very efficiently. I want to do, roughly, this: void __user *buf, etc; if (crypto api thinks async is good) { copy buf to some kernel memory; set up a scatterlist; do it async with this callback; } else { do the crypto synchronously, from *user* memory, straight to kernel memory; (or, if that's too complicated, maybe copy in little chunks to a little stack buffer. setting up a scatterlist is a waste of time.) } I don't know if the network code is structured in a way to make this work easily, and the API would be more complex, but it could be nice and fast.