Hi David, On 1/12/22, David Laight <David.Laight@xxxxxxxxxx> wrote: > I think you mentioned in another thread that the buffers (eg for IPv6 > addresses) are actually often quite short. > > For short buffers the 'rolled-up' loop may be of similar performance > to the unrolled one because of the time taken to read all the instructions > into the I-cache and decode them. > If the loop ends up small enough it will fit into the 'decoded loop > buffer' of modern Intel x86 cpu and won't even need decoding on > each iteration. > > I really suspect that the heavily unrolled loop is only really fast > for big buffers and/or when it is already in the I-cache. > In real life I wonder how often that actually happens? > Especially for the uses the kernel is making of the code. > > You need to benchmark single executions of the function > (doable on x86 with the performance monitor cycle counter) > to get typical/best clocks/byte figures rather than a > big average for repeated operation on a long buffer. > > David This patch has been dropped entirely from future revisions. The latest as of writing is at: https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@xxxxxxxxx/ If you'd like to do something with blake2s, by all means submit a patch and include various rationale and metrics and benchmarks. I do not intend to do that myself and do not think my particular patch here should be merged. But if you'd like to do something, feel free to CC me for a review. However, as mentioned, I don't think much needs to be done here. Again, v3 is here: https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@xxxxxxxxx/ Thanks, Jason