Re: [RFC PATCH 1/6] crypto: sha512: implement base layer for SHA-512

Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> · Sun, 29 Mar 2015 12:38:11 +0200

On 29 March 2015 at 10:29, Markus Stockhausen <stockhausen@xxxxxxxxxxx> wrote:
>> Von: linux-crypto-owner@xxxxxxxxxxxxxxx [linux-crypto-owner@xxxxxxxxxxxxxxx]&quot; im Auftrag von &quot;Ard Biesheuvel [ard.biesheuvel@xxxxxxxxxx]
>> Gesendet: Samstag, 28. März 2015 23:10
>> An: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; samitolvanen@xxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; jussi.kivilinna@xxxxxx
>> Cc: Ard Biesheuvel
>> Betreff: [RFC PATCH 1/6] crypto: sha512: implement base layer for SHA-512
>>
>> To reduce the number of copies of boilerplate code throughout
>> the tree, this patch implements generic glue for the SHA-512
>> algorithm. This allows a specific arch or hardware implementation
>> to only implement the special handling that it needs.
>
> Hi Ard,
>
> Implementing a common layer is a very good idea - I didn't like to
> implement the glue code once again for some recently developed
> PPC crypto modules. From my very short crypto experience I was
> surprised that my optimized implementations degraded disproportional
> for small calculations in the <=256byte update scenarios in contrast to
> some very old basic implementations. Below you will find some hints,
> that might fit your implementation too. Thus all new implementations
> based on your framework could benefit immediately.
>

Thanks for taking a look!

>> ...
>> +int sha384_base_init(struct shash_desc *desc)
>> +{
>> +       struct sha512_state *sctx = shash_desc_ctx(desc);
>> +
>> +       *sctx = (struct sha512_state){
>> +               .state = {
>> +                       SHA384_H0, SHA384_H1, SHA384_H2, SHA384_H3,
>> +                       SHA384_H4, SHA384_H5, SHA384_H6, SHA384_H7,
>> +               }
>> +       };
>> +       return 0;
>> +}
>
> IIRC the above code will initialize the whole context including the 64/128
> byte buffer. Direct assignment of the 8 hashes was faster in my case.
>

Ah, I missed that. I will change it.

>> ...
>> +int sha512_base_do_update(struct shash_desc *desc, const u8 *data,
>> +                         unsigned int len, sha512_block_fn *block_fn, void *p)
>> +{
>> +       struct sha512_state *sctx = shash_desc_ctx(desc);
>> +       unsigned int partial = sctx->count[0] % SHA512_BLOCK_SIZE;
>> +
>> +       sctx->count[0] += len;
>> +       if (sctx->count[0] < len)
>> +               sctx->count[1]++;
>
> You should check if early kick out at this point if the buffer won't be filled up
> is faster than first taking care about big data. That can improve performance
> for small blocks while large blocks might be unaffected.
>
>> +
>> +       if ((partial + len) >= SHA512_BLOCK_SIZE) {

Isn't this early kickout? The if is only entered if there is enough
data to run the block function, otherwise it is a straight memcpy. I
could add an unlikely() here to favor the small data case

>> +               int blocks;
>> +
>> +               if (partial) {
>> +                       int p = SHA512_BLOCK_SIZE - partial;
>> +
>> +                       memcpy(sctx->buf + partial, data, p);
>> +                       data += p;
>> +                       len -= p;
>> +               }
>> +
>> +               blocks = len / SHA512_BLOCK_SIZE;
>> +               len %= SHA512_BLOCK_SIZE;
>> +
>> +               block_fn(blocks, data, sctx->state,
>> +                        partial ? sctx->buf : NULL, p);
>> +               data += blocks * SHA512_BLOCK_SIZE;
>> +               partial = 0;
>> +       }
>> +       if (len)
>> +               memcpy(sctx->buf + partial, data, len);
>> +
>> +       return 0;
>> +}
>> +EXPORT_SYMBOL(sha512_base_do_update);
>> +
>> +int sha512_base_do_finalize(struct shash_desc *desc, sha512_block_fn *block_fn,
>> +                           void *p)
>> +{
>> +       static const u8 padding[SHA512_BLOCK_SIZE] = { 0x80, };
>> +
>> +       struct sha512_state *sctx = shash_desc_ctx(desc);
>> +       unsigned int padlen;
>> +       __be64 bits[2];
>> +
>> +       padlen = SHA512_BLOCK_SIZE -
>> +                (sctx->count[0] + sizeof(bits)) % SHA512_BLOCK_SIZE;
>> +
>> +       bits[0] = cpu_to_be64(sctx->count[1] << 3 |
>> +                             sctx->count[0] >> 61);
>> +       bits[1] = cpu_to_be64(sctx->count[0] << 3);
>> +
>> +       sha512_base_do_update(desc, padding, padlen, block_fn, p);
>
> I know that this is the most intuitive and straight implementation for handling
> finalization. Nevertheless the maybe a little obscure generic md5 algorithm
> gives best in class performance for hash finalization of small input data.
>

Well, memcpy'ing a buffer consisting almost entirely of zeroes doesn't
quite feel right, indeed.
I will instead follow the md5 suggestion

> For comparison: From the raw numbers the sha1-ppc-spe assembler module
> written by me is only 10% faster than the old sha1-popwerpc assembler module.
> Both are simple assembler algorithms without hardware acceleration. For large
> blocks I gain another 8% by avoding function calls because the core module
> may process several blocks. But for small single block updates the above glue
> code optimizations gave
>
> 16byte block single update: +24%
> 64byte block single update: +16%
> 256byte block single update +12%
>
> Considering CPU assisted SHA calculations that percentage may be even higher.
>
> Maybe worth the effort ...
>

Absolutely!

Thanks again
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html