On 7 February 2014 10:23, Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Feb 07, 2014 at 08:30:26AM +0100, Ard Biesheuvel wrote: >> >> I agree that it would be trivial for cbc(%s) to probe for ecb(%s) >> before settling on using plain '%s. >> But how to probe for an /accelerated/ ecb(%s), i.e., how to avoid >> using the generic ecb(%s) which adds nothing but overhead? >> The other issue is how to find out what the optimal chunk size is for >> the accelerated ecb(%s) implementation, which would involve adding a >> struct member that holds the preferred number of blocks presented in a >> single invocation. >> In fact, that would solve both issues, as the probe could check this >> struct member for a >1 value (as my current series does but in against >> a cipher_alg instance) > > I'd like to see some numbers on the actual overhead of ecb before > we get too deeply into optimising it away. > Naturally. > In any case, one easy solution is to change the driver name of > generic ecb to ecb_generic (cf. ccm.c) which you could then > check for in cbc and elsewhere. > > Also, how are you determining the optimal number of blocks? For > the in-place case you're bound by how much memory you can find > for the temporary buffer. For the not-in-place case wouldn't you > just go for as much as you can? > Well, the thing with dedicated instructions is that the relation between chunk size and speedup is not a smooth curve. Of course, it is always beneficial to amortize e.g. loading of the round keys over as many blocks as possible, but what is especially wasteful are the pipeline stalls due to data dependencies between subsequent rounds. For instance, on arm64 // round 1 aese v0.16b, v16.16b aesmc v0.16b, v0.16b // round 2 aese v0.16b, v17.16b aesmc v0.16b, v0.16b //etc Putting all these instructions back to back makes it difficult for the pipeline to reach its full speed. However, replacing it by // round 1a aese v0.16b, v16.16b aesmc v0.16b, v0.16b // round 1b aese v1.16b, v16.16b aesmc v1.16b, v1.16b // round 2a aese v0.16b, v17.16b aesmc v0.16b, v0.16b // round 2b aese v1.16b, v17.16b aesmc v1.16b, v1.16b for a 2-way interleave may easily result in a significant speedup, while further interleaving brings little additional benefit (if the pipeline stalls have all been eliminated) Another example is bit sliced AES like the implementation in arch/arm/crypto. It is 45% faster than the ordinary ARM asm implementation, but its natural chunk size is 8 blocks. Passing fewer blocks hurts performance, while passing more blocks does not give any additional benefit at all. So in many cases, it would be good to know the preferred chunk size of an algorithm. Regards, Ard. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html