On Wed, 22 Apr 2020 at 22:17, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > On Wed, Apr 22, 2020 at 1:51 PM Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > > > On Wed, Apr 22, 2020 at 1:39 AM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote: > > > > > > On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > > > > > > > On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > > > > Seems this should just be a 'while' loop? > > > > > > > > > > while (bytes) { > > > > > unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes); > > > > > > > > > > kernel_neon_begin(); > > > > > chacha_doneon(state, dst, src, todo, nrounds); > > > > > kernel_neon_end(); > > > > > > > > > > bytes -= todo; > > > > > src += todo; > > > > > dst += todo; > > > > > } > > > > > > > > The for(;;) is how it's done elsewhere in the kernel (that this patch > > > > doesn't touch), because then we can break out of the loop before > > > > having to increment src and dst unnecessarily. Likely a pointless > > > > optimization as probably the compiler can figure out how to avoid > > > > that. But maybe it can't. If you have a strong preference, I can > > > > reactor everything to use `while (bytes)`, but if you don't care, > > > > let's keep this as-is. Opinion? > > > > > > > > > > Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here, > > > given that bytes is guaranteed to be non-zero before we enter the > > > loop. But in any case, I'd prefer avoiding for(;;) or while(1) where > > > we can. > > > > Okay, will do-while it up for v2. > > I just sent v2 containing do-while, and I'm fine with that going in > that way. But just in the interest of curiosity in the pan-tone > palette, check this out: > > https://godbolt.org/z/VxXien > > It looks like on mine, the compiler avoids unnecessarily calling those > adds on the last iteration, but on the other hand, it results in an > otherwise unnecessary unconditional jump for the < 4096 case. Sort of > interesting. Arm64 code is more or less the same difference too. Yeah, even if shaving off 1 or 2 cycles mattered here (since we've just decided that ugh() may take up to 20,000 cycles), hiding a couple of ALU instructions in the slots between the subs (which sets the zero flag) and the conditional branch that tests it probably comes for free on in-order cores anyway. And even if it didn't, backwards branches are usually statically predicted as taken, in which case their results are actually needed. On out-of-order cores under speculation, none of this matters anyway.