Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption

Ard Biesheuvel <ardb@xxxxxxxxxx> · Fri, 25 Nov 2022 09:59:17 +0100

On Fri, 25 Nov 2022 at 09:41, Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Nov 16, 2022 at 12:13:51PM +0100, Jason A. Donenfeld wrote:
> > On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote:
> > > +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */
> > > +static const unsigned int bytes_per_fpu = 337 * 1024;
> > > +
> >
> > Use an enum for constants like this:
> >
> >     enum { BYTES_PER_FPU = ... };
> >
> > You can even make it function-local, so it's near the code that uses it,
> > which will better justify its existence.
> >
> > Also, where did you get this number? Seems kind of weird.
>
> These numbers are highly dependent on hardware and I think having
> them hard-coded is wrong.
>
> Perhaps we should try a different approach.  How about just limiting
> the size to 4K, and then depending on need_resched we break out of
> the loop? Something like:
>
>         if (!len)
>                 return 0;
>
>         kernel_fpu_begin();
>         for (;;) {
>                 unsigned int chunk = min(len, 4096);
>
>                 sha1_base_do_update(desc, data, chunk, sha1_xform);
>
>                 len -= chunk;
>                 data += chunk;
>
>                 if (!len)
>                         break;
>
>                 if (need_resched()) {
>                         kernel_fpu_end();
>                         cond_resched();
>                         kernel_fpu_begin();
>                 }
>         }
>         kernel_fpu_end();
>

On arm64, this is implemented in an assembler macro 'cond_yield' so we
don't need to preserve/restore the SIMD state state at all if the
yield is not going to result in a call to schedule(). For example, the
SHA3 code keeps 400 bytes of state in registers, which we don't want
to save and reload unless needed. (5f6cb2e617681 'crypto:
arm64/sha512-ce - simplify NEON yield')

So the asm routines will call cond_yield, and return early if a yield
is required, with the number of blocks or bytes left to process as the
return value. The C wrapper just calls the asm routine in a loop until
the return value becomes 0.

That way, we don't need magic values at all, and the yield will occur
as soon as the asm inner loop observes the yield condition so the
latency should be much lower as well.

Note that it is only used in shash implementations, given that they
are the only ones that may receive unbounded inputs.