On Thu, Dec 12, 2019 at 3:26 PM Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: > > On Thu, 12 Dec 2019 at 14:47, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > > > On Thu, Dec 12, 2019 at 2:08 PM Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > > > > > Hi Martin, > > > > > > On Thu, Dec 12, 2019 at 1:03 PM Martin Willi <martin@xxxxxxxxxxxxxx> wrote: > > > > Can you provide some numbers to testify that? In my tests, the 32-bit > > > > version gives me exact the same results. > > > > > > On 32-bit, if you only call update() once, then the results are the > > > same. However, as soon as you call it more than once, this new version > > > has increasing gains. Other than that, they should behave pretty much > > > identically. > > > > Oh, you asked for numbers. I just fired up an Armada 370/XP and am > > seeing a 8% increase in performance on calls to the update function. > > It would help if we could get some actual numbers. I usually try to > capture the performance delta for a small set of block sizes that are > significant for the use case at hand, e.g., like so [0], and also > include blocksizes that are not 2^n. If the change improves the > general case without causing any significant regressions elsewhere, I > don't think we need to continue this debate. I'm not sure I understand why the 32x32 performance discussion is even happening in the first place. The new 32x32 code most certainly doesn't make anything worse. It most likely makes some things better in some places -- 8% on that machine I fired up, maybe more and maybe less other places. But who even cares? The principle advantage of this patchset is the 64x64 code, and I think we gain something else, immeasurable, by having parallel and comparable implementations. Please, let's not turn this into another pointless debate.