Re: [PATCH 1/2] crypto: s(h)aving 40+ bytes off arch/x86/crypto/sha256_ni_asm.S

"Stefan Kanthak" <stefan.kanthak@xxxxxxxx> · Tue, 9 Apr 2024 12:23:13 +0200

"Eric Biggers" <ebiggers@xxxxxxxxxx> wrote:

> [+Cc linux-crypto]
> 
> Please use reply-all so that the list gets included.
> 
> On Mon, Apr 08, 2024 at 04:15:32PM +0200, Stefan Kanthak wrote:
>> Hi Eric,
>> 
>> > On Mon, Apr 08, 2024 at 11:26:52AM +0200, Stefan Kanthak wrote:
>> >> Use shorter SSE2 instructions instead of some SSE4.1
>> >> use short displacements into K256
>> >> 
>> >> --- -/arch/x86/crypto/sha256_ni_asm.S
>> >> +++ +/arch/x86/crypto/sha256_ni_asm.S
>> > 
>> > Thanks!  I'd like to benchmark this to see how it affects performance,
>> 
>> Performance is NOT affected: if CPUs weren't superscalar, the patch might
>> save 2 to 4 processor cycles as it replaces palignr/pblendw (slow) with
>> punpck*qdq (fast and shorter)
>> 
>> > but unfortunately this patch doesn't apply.  It looks your email client
>> > corrupted your patch by replacing tabs with spaces.  Can you please use
>> > 'git send-email' to send patches?
>> 
>> I don't use git at all; I'll use cURL instead.

[...]

>> > Please make sure to run the crypto self-tests too.
>> 
>> I can't, I don't use Linux at all; I just noticed that this function uses
>> 4-byte displacements and palignr/pblendw instead of punpck?qdq after pshufd 
>> 
>> > The above is storing the two halves of the state in the wrong order.
>> 
>> ARGH, you are right; I recognized it too, wanted to correct it, but was
>> interrupted and forgot it after returning to the patch. Sorry.
> 
> I'm afraid that if you don't submit a probably formatted and tested patch, your
> patch can't be accepted.  We can treat it as a suggestion, though since you're
> sending actual code it would really help if it had your Signed-off-by.

Treat is as suggestion.
I but wonder that in the past 9 years since Tim Chen submitted the SHA-NI code
(which was copied umpteen times by others and included in their own code bases)
nobody noticed/questioned (or if so, bothered to submit a patch like mine, that
reduces the code size by 5%, upstream) why he used 16x "pshufd $14, %xmm0, %xmm0"
instead of the 1 byte shorter "punpckhqdq %xmm0, %xmm0" or "psrldq $8, %xmm0"
(which both MAY execute on more ports or faster than the shuffle instructions,
depending on the micro-architecture), why he used 8x a 4-byte instead of a 1-byte
displacement, or why he used "palignr/pblendw" instead of the shorter "punpck?qdq".

regards
Stefan

PS: aaaahhhh, you picked my suggestion up and applied it to the AES routine.