Re: [PATCH v8 0/7] Optimize dm-verity and fsverity using multibuffer hashing

Eric Biggers <ebiggers@xxxxxxxxxx> · Wed, 12 Feb 2025 22:33:04 -0800

On Thu, Feb 13, 2025 at 12:17:42PM +0800, Herbert Xu wrote:
> On Wed, Feb 12, 2025 at 07:47:11AM -0800, Eric Biggers wrote:
> > [ This patchset keeps getting rejected by Herbert, who prefers a
> >   complex, buggy, and slow alternative that shoehorns CPU-based hashing
> >   into the asynchronous hash API which is designed for off-CPU offload:
> >   https://lore.kernel.org/linux-crypto/cover.1730021644.git.herbert@xxxxxxxxxxxxxxxxxxx/
> >   This patchset is a much better way to do it though, and I've already
> >   been maintaining it downstream as it would not be reasonable to go the
> >   asynchronous hash route instead.  Let me know if there are any
> >   objections to me taking this patchset through the fsverity tree, or at
> >   least patches 1-5 as the dm-verity patches could go in separately. ]
> 
> Yes I object.  While I very much like this idea of parallel hashing
> that you're introducing, shoehorning it into shash is restricting
> this to storage-based users.
> 
> Networking is equally able to benefit from paralell hashing, and
> parallel crypto (in particular, AEAD) in general.  In fact, both
> TLS and IPsec can benefit directly from bulk submission instead
> of the current scheme where a single packet is processed at a time.

I've already covered this extensively, but here we go again.  First there are
more users of shash than ahash in the kernel, since shash is much easier to use
and also a bit faster.  There is nothing storage specific about it.  You've
claimed that shash is deprecated, but that reflects a misunderstanding of what
users actually want and need.  Users want simple, fast, easy-to-use APIs.  Not
APIs that are optimized for an obsolete form of hardware offload and have
CPU-based crypto support bolted on as an afterthought.

Second, these days TLS and IPsec usually use AES-GCM, which is inherently
parallelizable so does not benefit from multibuffer crypto.  This is a major
difference between the AEADs and message digest algorithms in common use.  And
it happens that I recently did a lot of work to optimize AES-GCM on x86_64; see
my commits in v6.11 that made AES-GCM 2-3x as fast on VAES-capable CPUs.

So anyone who cares about TLS or IPsec performance should of course be using
AES-GCM, as it's the fastest by far, and it has no need for multibuffer.  But
even for the rare case where someone is still using a legacy algorithm like
"authenc(hmac(sha256),cbc(aes))" for some reason, as I've explained before there
are much lower hanging fruit to optimizing it.  For example x86_64 still has no
implementation of the authenc template, let alone one that interleaves the
encryption with the MAC.  Both could be done today with the current crypto API.
Meanwhile multibuffer crypto would be very hard to apply to that use case (much
harder than the cases where I've applied it) and would not be very effective,
for reasons such as the complexity of that combination of algorithms vs. just
SHA-256.  Again, see
https://lore.kernel.org/linux-crypto/20240605191410.GB1222@sol.localdomain/,
https://lore.kernel.org/linux-crypto/20240606052801.GA324380@sol.localdomain/,
https://lore.kernel.org/linux-crypto/20240610164258.GA3269@sol.localdomain/,
https://lore.kernel.org/linux-crypto/20240611203209.GB128642@sol.localdomain/,
https://lore.kernel.org/linux-crypto/20240611201858.GA128642@sol.localdomain/
where I've already explained this in detail.

You've drawn an analogy to TSO and claimed that submitting multiple requests to
the crypto API at once would significantly improve performance even without
support from the underlying algorithm.  But that is incorrect.  TSO saves a
significant amount of time due to how the networking stack works.  In contrast,
the equivalent in the crypto API would do very little.  It would at best save
one indirect call per message, at the cost of adding the overhead of multi
request support.  Even assuming it was beneficial at all, it would be a very
minor optimization, and not worth it over other optimization opportunities that
would not require making complex and error-prone extensions to the crypto API.

I remain quite puzzled by your position here, as it makes no sense.  TBH, I
think your opinions would be more informed if you had more experience with
actually implementing and optimizing the various crypto algorithms.

> But thanks for the reminder and I will be posting my patches soon.

I am not really interested in using your patches, sorry.  They just seem like
really poor engineering and a continuation of many of the worst practices of the
kernel crypto API that we *know* are not working.  Especially for cryptography
code, we need to do better.  (And I've even already done it!)

- Eric