Re: [RFC PATCH 0/3] support for interleaving in generic chaining modes

Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> · Fri, 7 Feb 2014 10:42:14 +0100

On 7 February 2014 10:23, Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Feb 07, 2014 at 08:30:26AM +0100, Ard Biesheuvel wrote:
>>
>> I agree that it would be trivial for cbc(%s) to probe for ecb(%s)
>> before settling on using plain '%s.
>> But how to probe for an /accelerated/ ecb(%s), i.e., how to avoid
>> using the generic ecb(%s) which adds nothing but overhead?
>> The other issue is how to find out what the optimal chunk size is for
>> the accelerated ecb(%s) implementation, which would involve adding a
>> struct member that holds the preferred number of blocks presented in a
>> single invocation.
>> In fact, that would solve both issues, as the probe could check this
>> struct member for a >1 value (as my current series does but in against
>> a cipher_alg instance)
>
> I'd like to see some numbers on the actual overhead of ecb before
> we get too deeply into optimising it away.
>

Naturally.

> In any case, one easy solution is to change the driver name of
> generic ecb to ecb_generic (cf. ccm.c) which you could then
> check for in cbc and elsewhere.
>
> Also, how are you determining the optimal number of blocks? For
> the in-place case you're bound by how much memory you can find
> for the temporary buffer.  For the not-in-place case wouldn't you
> just go for as much as you can?
>

Well, the thing with dedicated instructions is that the relation
between chunk size and speedup is not a smooth curve.
Of course, it is always beneficial to amortize e.g. loading of the
round keys over as many blocks as possible, but what is especially
wasteful are the pipeline stalls due to data dependencies between
subsequent rounds.
For instance, on arm64

// round 1
aese   v0.16b, v16.16b
aesmc  v0.16b, v0.16b

// round 2
aese   v0.16b, v17.16b
aesmc  v0.16b, v0.16b

//etc

Putting all these instructions back to back makes it difficult for the
pipeline to reach its full speed. However, replacing it by

// round 1a
aese   v0.16b, v16.16b
aesmc  v0.16b, v0.16b

// round 1b
aese   v1.16b, v16.16b
aesmc  v1.16b, v1.16b

// round 2a
aese   v0.16b, v17.16b
aesmc  v0.16b, v0.16b

// round 2b
aese   v1.16b, v17.16b
aesmc  v1.16b, v1.16b

for a 2-way interleave may easily result in a significant speedup,
while further interleaving brings little additional benefit (if the
pipeline stalls have all been eliminated)

Another example is bit sliced AES like the implementation in
arch/arm/crypto. It is 45% faster than the ordinary ARM asm
implementation, but its natural chunk size is 8 blocks. Passing fewer
blocks hurts performance, while passing more blocks does not give any
additional benefit at all.

So in many cases, it would be good to know the preferred chunk size of
an algorithm.

Regards,
Ard.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html