On 29 March 2015 at 10:29, Markus Stockhausen <stockhausen@xxxxxxxxxxx> wrote: >> Von: linux-crypto-owner@xxxxxxxxxxxxxxx [linux-crypto-owner@xxxxxxxxxxxxxxx]" im Auftrag von "Ard Biesheuvel [ard.biesheuvel@xxxxxxxxxx] >> Gesendet: Samstag, 28. März 2015 23:10 >> An: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-crypto@xxxxxxxxxxxxxxx; samitolvanen@xxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; jussi.kivilinna@xxxxxx >> Cc: Ard Biesheuvel >> Betreff: [RFC PATCH 1/6] crypto: sha512: implement base layer for SHA-512 >> >> To reduce the number of copies of boilerplate code throughout >> the tree, this patch implements generic glue for the SHA-512 >> algorithm. This allows a specific arch or hardware implementation >> to only implement the special handling that it needs. > > Hi Ard, > > Implementing a common layer is a very good idea - I didn't like to > implement the glue code once again for some recently developed > PPC crypto modules. From my very short crypto experience I was > surprised that my optimized implementations degraded disproportional > for small calculations in the <=256byte update scenarios in contrast to > some very old basic implementations. Below you will find some hints, > that might fit your implementation too. Thus all new implementations > based on your framework could benefit immediately. > Thanks for taking a look! >> ... >> +int sha384_base_init(struct shash_desc *desc) >> +{ >> + struct sha512_state *sctx = shash_desc_ctx(desc); >> + >> + *sctx = (struct sha512_state){ >> + .state = { >> + SHA384_H0, SHA384_H1, SHA384_H2, SHA384_H3, >> + SHA384_H4, SHA384_H5, SHA384_H6, SHA384_H7, >> + } >> + }; >> + return 0; >> +} > > IIRC the above code will initialize the whole context including the 64/128 > byte buffer. Direct assignment of the 8 hashes was faster in my case. > Ah, I missed that. I will change it. >> ... >> +int sha512_base_do_update(struct shash_desc *desc, const u8 *data, >> + unsigned int len, sha512_block_fn *block_fn, void *p) >> +{ >> + struct sha512_state *sctx = shash_desc_ctx(desc); >> + unsigned int partial = sctx->count[0] % SHA512_BLOCK_SIZE; >> + >> + sctx->count[0] += len; >> + if (sctx->count[0] < len) >> + sctx->count[1]++; > > You should check if early kick out at this point if the buffer won't be filled up > is faster than first taking care about big data. That can improve performance > for small blocks while large blocks might be unaffected. > >> + >> + if ((partial + len) >= SHA512_BLOCK_SIZE) { Isn't this early kickout? The if is only entered if there is enough data to run the block function, otherwise it is a straight memcpy. I could add an unlikely() here to favor the small data case >> + int blocks; >> + >> + if (partial) { >> + int p = SHA512_BLOCK_SIZE - partial; >> + >> + memcpy(sctx->buf + partial, data, p); >> + data += p; >> + len -= p; >> + } >> + >> + blocks = len / SHA512_BLOCK_SIZE; >> + len %= SHA512_BLOCK_SIZE; >> + >> + block_fn(blocks, data, sctx->state, >> + partial ? sctx->buf : NULL, p); >> + data += blocks * SHA512_BLOCK_SIZE; >> + partial = 0; >> + } >> + if (len) >> + memcpy(sctx->buf + partial, data, len); >> + >> + return 0; >> +} >> +EXPORT_SYMBOL(sha512_base_do_update); >> + >> +int sha512_base_do_finalize(struct shash_desc *desc, sha512_block_fn *block_fn, >> + void *p) >> +{ >> + static const u8 padding[SHA512_BLOCK_SIZE] = { 0x80, }; >> + >> + struct sha512_state *sctx = shash_desc_ctx(desc); >> + unsigned int padlen; >> + __be64 bits[2]; >> + >> + padlen = SHA512_BLOCK_SIZE - >> + (sctx->count[0] + sizeof(bits)) % SHA512_BLOCK_SIZE; >> + >> + bits[0] = cpu_to_be64(sctx->count[1] << 3 | >> + sctx->count[0] >> 61); >> + bits[1] = cpu_to_be64(sctx->count[0] << 3); >> + >> + sha512_base_do_update(desc, padding, padlen, block_fn, p); > > I know that this is the most intuitive and straight implementation for handling > finalization. Nevertheless the maybe a little obscure generic md5 algorithm > gives best in class performance for hash finalization of small input data. > Well, memcpy'ing a buffer consisting almost entirely of zeroes doesn't quite feel right, indeed. I will instead follow the md5 suggestion > For comparison: From the raw numbers the sha1-ppc-spe assembler module > written by me is only 10% faster than the old sha1-popwerpc assembler module. > Both are simple assembler algorithms without hardware acceleration. For large > blocks I gain another 8% by avoding function calls because the core module > may process several blocks. But for small single block updates the above glue > code optimizations gave > > 16byte block single update: +24% > 64byte block single update: +16% > 256byte block single update +12% > > Considering CPU assisted SHA calculations that percentage may be even higher. > > Maybe worth the effort ... > Absolutely! Thanks again -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html