Re: [PATCH 1/2] crypto: s(h)aving 40+ bytes off arch/x86/crypto/sha256_ni_asm.S

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Apr 09, 2024 at 12:23:13PM +0200, Stefan Kanthak wrote:
> "Eric Biggers" <ebiggers@xxxxxxxxxx> wrote:
> 
> > [+Cc linux-crypto]
> > 
> > Please use reply-all so that the list gets included.
> > 
> > On Mon, Apr 08, 2024 at 04:15:32PM +0200, Stefan Kanthak wrote:
> >> Hi Eric,
> >> 
> >> > On Mon, Apr 08, 2024 at 11:26:52AM +0200, Stefan Kanthak wrote:
> >> >> Use shorter SSE2 instructions instead of some SSE4.1
> >> >> use short displacements into K256
> >> >> 
> >> >> --- -/arch/x86/crypto/sha256_ni_asm.S
> >> >> +++ +/arch/x86/crypto/sha256_ni_asm.S
> >> > 
> >> > Thanks!  I'd like to benchmark this to see how it affects performance,
> >> 
> >> Performance is NOT affected: if CPUs weren't superscalar, the patch might
> >> save 2 to 4 processor cycles as it replaces palignr/pblendw (slow) with
> >> punpck*qdq (fast and shorter)
> >> 
> >> > but unfortunately this patch doesn't apply.  It looks your email client
> >> > corrupted your patch by replacing tabs with spaces.  Can you please use
> >> > 'git send-email' to send patches?
> >> 
> >> I don't use git at all; I'll use cURL instead.
> 
> [...]
> 
> >> > Please make sure to run the crypto self-tests too.
> >> 
> >> I can't, I don't use Linux at all; I just noticed that this function uses
> >> 4-byte displacements and palignr/pblendw instead of punpck?qdq after pshufd 
> >> 
> >> > The above is storing the two halves of the state in the wrong order.
> >> 
> >> ARGH, you are right; I recognized it too, wanted to correct it, but was
> >> interrupted and forgot it after returning to the patch. Sorry.
> > 
> > I'm afraid that if you don't submit a probably formatted and tested patch, your
> > patch can't be accepted.  We can treat it as a suggestion, though since you're
> > sending actual code it would really help if it had your Signed-off-by.
> 
> Treat is as suggestion.

All right.  I'll send out a properly formatted and tested patch then.  I'd also
like to convert the SHA-256 rounds to use macros, which would make the source
150 lines shorter (without changing the output).  I'll probably do that first.

> I but wonder that in the past 9 years since Tim Chen submitted the SHA-NI code
> (which was copied umpteen times by others and included in their own code bases)
> nobody noticed/questioned (or if so, bothered to submit a patch like mine, that
> reduces the code size by 5%, upstream) why he used 16x "pshufd $14, %xmm0, %xmm0"
> instead of the 1 byte shorter "punpckhqdq %xmm0, %xmm0" or "psrldq $8, %xmm0"
> (which both MAY execute on more ports or faster than the shuffle instructions,
> depending on the micro-architecture), why he used 8x a 4-byte instead of a 1-byte
> displacement, or why he used "palignr/pblendw" instead of the shorter "punpck?qdq".

Not many people work on crypto assembly code, and x86 SIMD is especially
difficult because there are often multiple ways to do things that differ in
subtle ways such as instruction length and the execution ports used on different
models of CPU.  I think your suggestions are good, so thanks for them.

> 
> regards
> Stefan
> 
> PS: aaaahhhh, you picked my suggestion up and applied it to the AES routine.

Yes, I realized that a similar optimization can apply to AES round keys, as they
can be indexed them from -112 through 112 instead of 0 through 224.

- Eric




[Index of Archives]     [Kernel]     [Gnu Classpath]     [Gnu Crypto]     [DM Crypt]     [Netfilter]     [Bugtraq]
  Powered by Linux