On Fri, May 03, 2024 at 06:18:32PM +0800, Herbert Xu wrote: > Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > > > For now the API only supports 2-way interleaving, as the usefulness and > > practicality seems to drop off dramatically after 2. The arm64 CPUs I > > tested don't support more than 2 concurrent SHA-256 hashes. On x86_64, > > AMD's Zen 4 can do 4 concurrent SHA-256 hashes (at least based on a > > microbenchmark of the sha256rnds2 instruction), and it's been reported > > that the highest SHA-256 throughput on Intel processors comes from using > > AVX512 to compute 16 hashes in parallel. However, higher interleaving > > factors would involve tradeoffs such as no longer being able to cache > > the round constants in registers, further increasing the code size (both > > source and binary), further increasing the amount of state that users > > need to keep track of, and causing there to be more "leftover" hashes. > > I think the lack of extensibility is the biggest problem with this > API. Now I confess I too have used the magic number 2 in the > lskcipher patch-set, but there I think at least it was more > justifiable based on the set of algorithms we currently support. > > Here I think the evidence for limiting this to 2 is weak. And the > amount of work to extend this beyond 2 would mean ripping this API > out again. > > So let's get this right from the start. Rather than shoehorning > this into shash, how about we add this to ahash instead where an > async return is a natural part of the API? > > In fact, if we do it there we don't need to make any major changes > to the API. You could simply add an optional flag that to the > request flags to indicate that more requests will be forthcoming > immediately. > > The algorithm could then either delay the current request if it > is supported, or process it immediately as is the case now. > The kernel already had ahash-based multibuffer hashing years ago. It failed spectacularly, as it was extremely complex, buggy, slow, and potentially insecure as it mixed requests from different contexts. Sure, it could have been improved slightly by adding flush support, but most issues would have remained. Synchronous hashing really is the right model here. One of the main performance issues we are having with dm-verity and fs-verity is the scheduling hops associated with the workqueues on which the dm-verity and fs-verity work runs. If there was another scheduling hop from the worker task to another task to do the actual hashing, that would be even worse and would defeat the point of doing multibuffer hashing. And with the ahash based API this would be difficult to avoid, as when an individual request gets submitted and put on a queue somewhere it would lose the information about the original submitter, so when it finally gets hashed it might be by another task (which the original task would then have to wait for). I guess the submitter could provide some sort of tag that makes the request be put on a dedicated queue that would eventually get processed only by the same task (which might also be needed for security reasons anyway, due to all the CPU side channels), but that would add lots of complexity to add tag support to the API and support an arbitrary number of queues. And then there's the issue of request lengths. With one at a time submission via 'ahash_request', each request would have its own length. Having to support multibuffer hashing of different length requests would add a massive amount of complexity and edge cases that are difficult to get correct, as was shown by the old ahash based code. This suggests that either the API needs to enforce that all the lengths are the same, or it needs to provide a clean API (my patch) where the caller just provides a single length that applies to all messages. So the synchronous API really seems like the right approach, whereas shoehorning it into the asynchronous hash API would result in something much more complex and not actually useful for the intended use cases. If you're concerned about the hardcoding to 2x specifically, how about the following API instead: int crypto_shash_finup_mb(struct shash_desc *desc, const u8 *datas[], unsigned int len, u8 *outs[], int num_msgs) This would allow extension to higher interleaving factors. I do suspect that anything higher than 2x isn't going to be very practical for in-kernel use cases, where code size, latency, and per-request memory usage tend to be very important. Regardless, this would make the API able to support higher interleaving factors. - Eric