Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

Eric Biggers <ebiggers@xxxxxxxxxx> · Mon, 10 Jun 2024 09:42:58 -0700

Hi Herbert,

On Thu, Jun 06, 2024 at 05:15:16PM +0800, Herbert Xu wrote:
> On Thu, Jun 06, 2024 at 10:33:15AM +0200, Ard Biesheuvel wrote:
> >
> > Are you suggesting that, e.g., the arm64 sha2 shash implementation
> > that is modified by this series should instead expose both an shash as
> > before, and an ahash built around the same asm code that exposes the
> > multibuffer capability?
> 
> Yes the multi-request handling should be implemented as ahash
> only.
> 

That is not a reasonable way to do it.  It would be much more complex, more
error-prone, and slower than my proposal.  Also your proposal wouldn't actually
bring you much closer to being able to optimize your authenc use case.

First, each algorithm implementation that wants to support your scheme would
need to implement the ahash API alongside shash.  This would be a lot of
duplicated code, and it would need to support all quirks of the ahash API
including scatterlists.

Second, allowing an ahash_request to actually be a list of requests creates an
ambiguity where it will often be unclear when code is supposed to operate on an
ahash_request individually vs. on the whole list.  This is error-prone, as it
invites bugs where a crypto operation is done on only one request of the list.
This is a bad design, especially for a cryptographic API.  We would have
crypto_ahash_*() do some checks to prevent multi-requests from reaching
algorithms that don't support the particular kind of request being made.  But
there will be a lot of cases to consider.

Third, it would be hard to actually implement multibuffer hashing given an
arbitrary list of requests.  For multibuffer hashing to work, the lengths of all
the buffers must be synced up, including the internal buffers in the hash states
as well as every buffer that is produced while walking the scatterlists.  We can
place constraints on what is supported, but those constraints will need to be
clearly specified, and each algorithm will actually need to check for them and
do something sane (error or fallback) when they are not met.  Note that it would
be possible for the messages to get out of sync in the middle of walking the
scatterlists, which would be difficult to handle.  All this would add a lot of
complexity and overhead compared to my proposal, which naturally expresses the
same-length constraints in the API.

Fourth, having the API be ahash-only would also force fsverity to switch back to
the ahash API, which would add complexity and overhead.  The shash API is easier
to use even when working with pages, as the diffstat of commit 8fcd94add6c5
clearly shows.  The ahash API also has more overhead, including in doing the
needed memory allocation, setting up a scatterlist, initializing required but
irrelevant or redundant fields in the ahash_request, and shash_ahash_finup()
running the fully generalized scatterlist walker on the scatterlist just to
finally end up with a pointer which the caller could have provided directly.
All this overhead adds up, even when hashing 4K blocks.  Back when CPU-based
crypto was slow this didn't really matter, but now it's fast and these small
overheads are significant when trying to keep up with storage device speeds
(which are also fast now).  E.g., even disregarding the memory allocation,
hashing a 4K block is about 5% faster with shash than with ahash on x86_64.

Of course, I'd need to support ahash in fsverity anyway if someone were to
actually need support for non-inline hardware offload in fsverity (and if I
decide to accept that despite many hardware drivers being buggy).  But really
the best way to do this would be to support ahash alongside shash, like what I'm
proposing in dm-verity -- perhaps with ahash support limited to the data blocks
only, as that's the most performance critical part.  The ahash code path would
use the async callback to actually properly support offload, which would mean
the code would be substantially different anyway due to the fundamentally
different computational model, especially vs. multibuffer hashing.  A "single"
API, that works for both hardware offload and for the vast majority of users
that simply need very low overhead software crypto, would be nice in theory.
But I'm afraid it's been shown repeatedly that just doesn't work...  The
existence of lib/crypto/, shash, lskcipher, and scomp all reflect this.

I understand that you think the ahash based API would make it easier to add
multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which seems
to be a very important use case for you (though it isn't relevant to nearly as
many systems as dm-verity and fsverity are).  Regardless, the reality is that it
would be much more difficult to take advantage of multibuffer crypto in the
IPsec authenc use case than in dm-verity and fsverity.  authenc uses multiple
underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use
multibuffer crypto in order to see a significant benefit, seeing as even if the
SHA-256 support could be wired up through HMAC-SHA256, encryption would be
bottlenecked on AES-CBC, especially on Intel CPUs.  It also looks like the IPsec
code would need a lot of updates to support multibuffer crypto.

At the same time, an easier way to optimize "authenc(hmac(sha256),cbc(aes))"
would be to add an implementation of it to arch/x86/crypto/ that interleaves the
AES-CBC and SHA-256 and also avoids the overhead associated with the template
based approach (such as all the extra indirect calls).  Of course, that would
require writing assembly, but so would multibuffer crypto.  It seems unlikely
that someone would step in to do all the work to implement a multibuffer
optimization for this algorithm and its use in IPsec, when no one has ever
bothered to optimize the single-buffer case in the first place, which has been
possible all along and would require no API or IPsec changes...

In any case, any potential multi-request support in ahash, skcipher, or aead
should be separate considerations from the simple function in shash that I'm
proposing.  It makes sense to have the shash support regardless.

Ultimately, I need to have dm-verity and fsverity be properly optimized in the
downstreams that are most relevant to me.  If you're not going to allow the
upstream crypto API to provide the needed functionality in a reasonable way,
then I'll need to shift my focus to getting this patchset into downstream
kernels such as Android and Chrome OS instead.

- Eric