"Eric Biggers" <ebiggers@xxxxxxxxxx> wrote: > [+Cc linux-crypto] > > Please use reply-all so that the list gets included. > > On Mon, Apr 08, 2024 at 04:15:32PM +0200, Stefan Kanthak wrote: >> Hi Eric, >> >> > On Mon, Apr 08, 2024 at 11:26:52AM +0200, Stefan Kanthak wrote: >> >> Use shorter SSE2 instructions instead of some SSE4.1 >> >> use short displacements into K256 >> >> >> >> --- -/arch/x86/crypto/sha256_ni_asm.S >> >> +++ +/arch/x86/crypto/sha256_ni_asm.S >> > >> > Thanks! I'd like to benchmark this to see how it affects performance, >> >> Performance is NOT affected: if CPUs weren't superscalar, the patch might >> save 2 to 4 processor cycles as it replaces palignr/pblendw (slow) with >> punpck*qdq (fast and shorter) >> >> > but unfortunately this patch doesn't apply. It looks your email client >> > corrupted your patch by replacing tabs with spaces. Can you please use >> > 'git send-email' to send patches? >> >> I don't use git at all; I'll use cURL instead. [...] >> > Please make sure to run the crypto self-tests too. >> >> I can't, I don't use Linux at all; I just noticed that this function uses >> 4-byte displacements and palignr/pblendw instead of punpck?qdq after pshufd >> >> > The above is storing the two halves of the state in the wrong order. >> >> ARGH, you are right; I recognized it too, wanted to correct it, but was >> interrupted and forgot it after returning to the patch. Sorry. > > I'm afraid that if you don't submit a probably formatted and tested patch, your > patch can't be accepted. We can treat it as a suggestion, though since you're > sending actual code it would really help if it had your Signed-off-by. Treat is as suggestion. I but wonder that in the past 9 years since Tim Chen submitted the SHA-NI code (which was copied umpteen times by others and included in their own code bases) nobody noticed/questioned (or if so, bothered to submit a patch like mine, that reduces the code size by 5%, upstream) why he used 16x "pshufd $14, %xmm0, %xmm0" instead of the 1 byte shorter "punpckhqdq %xmm0, %xmm0" or "psrldq $8, %xmm0" (which both MAY execute on more ports or faster than the shuffle instructions, depending on the micro-architecture), why he used 8x a 4-byte instead of a 1-byte displacement, or why he used "palignr/pblendw" instead of the shorter "punpck?qdq". regards Stefan PS: aaaahhhh, you picked my suggestion up and applied it to the AES routine.