Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

"Jason A. Donenfeld" <Jason@xxxxxxxxx> · Wed, 22 Apr 2020 14:17:28 -0600

On Wed, Apr 22, 2020 at 1:51 PM Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
>
> On Wed, Apr 22, 2020 at 1:39 AM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
> >
> > On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
> > >
> > > On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> > > > Seems this should just be a 'while' loop?
> > > >
> > > >         while (bytes) {
> > > >                 unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> > > >
> > > >                 kernel_neon_begin();
> > > >                 chacha_doneon(state, dst, src, todo, nrounds);
> > > >                 kernel_neon_end();
> > > >
> > > >                 bytes -= todo;
> > > >                 src += todo;
> > > >                 dst += todo;
> > > >         }
> > >
> > > The for(;;) is how it's done elsewhere in the kernel (that this patch
> > > doesn't touch), because then we can break out of the loop before
> > > having to increment src and dst unnecessarily. Likely a pointless
> > > optimization as probably the compiler can figure out how to avoid
> > > that. But maybe it can't. If you have a strong preference, I can
> > > reactor everything to use `while (bytes)`, but if you don't care,
> > > let's keep this as-is. Opinion?
> > >
> >
> > Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here,
> > given that bytes is guaranteed to be non-zero before we enter the
> > loop. But in any case, I'd prefer avoiding for(;;) or while(1) where
> > we can.
>
> Okay, will do-while it up for v2.

I just sent v2 containing do-while, and I'm fine with that going in
that way. But just in the interest of curiosity in the pan-tone
palette, check this out:

https://godbolt.org/z/VxXien

It looks like on mine, the compiler avoids unnecessarily calling those
adds on the last iteration, but on the other hand, it results in an
otherwise unnecessary unconditional jump for the < 4096 case. Sort of
interesting. Arm64 code is more or less the same difference too.