RE: [PATCH] crypto: x86: Do not acquire fpu context for too long

"Elliott, Robert (Servers)" <elliott@xxxxxxx> · Tue, 11 Oct 2022 15:16:12 +0000

> > From: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
...
> > On Sat, Oct 08, 2022 at 07:48:07PM +0000, Elliott, Robert (Servers) wrote:
> > >
> > > Perhaps the cycles mode needs to call cond_resched() too?
> >
> > Yes, just make the cond_resched unconditional.  Having a few too many
> > rescheds shouldn't be an issue.
> 
> This looks promising. I was able to trigger a lot of rcu stalls by setting:
>   echo 2 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
>   echo 200 > /sys/module/rcupdate/parameters/rcu_exp_cpu_stall_timeout
> 
> and running these concurrently:
>   watch -n 0 modprobe tcrypt=200
>   watch -n 0 module tcrypt=0 through 999
> 
> I am getting miscompares from the extended self-test for crc32 and
> crct10dif, and will investigate those further.

The assembly functions for those two do not handle small sizes (unlike
crc32c, which handles all sizes), so the new do/while loop needs to
use larger steps and call the generic function for any leftover bytes.
I've corrected that patch.

> After changing tcrypt to call cond_resched in both cases, I don't see any
> more rcu stalls.

This ran cleanly with a few nights of testing with 
   CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y

which compares the accelerated implementation to the generic
implementation under randomly generated test vectors - a very
helpful test suite. It only works with data sizes up to
2 * PAGE_SIZE, though.

With the cond_resched changes, rcu stalls are gone with one exception.
If printing of Call Traces is triggered for any reason, either by a
test failing (e.g., because of the CRC loop bugs) or by using a SysRq
trigger like:
   echo l > /proc/sysrq-trigger

then the very act of generating and printing the Call Traces can
trigger rcu stall messages. Examples:

  Oct 09 14:56:39 kernel: cryptomgr: alg: shash: crct10dif-pclmul test failed (wrong result) on test vector "random: psize=4303 ksize=0", cfg="random: use_final src_divs=[<flush>95.43%@+4, <flush>4.57%@+6]"
  Oct 09 14:56:39 kernel: cryptomgr: alg: self-tests for crct10dif using crct10dif failed (rc=-22)
  Oct 09 14:56:39 kernel: ------------[ cut here ]------------
  Oct 09 14:56:39 kernel: alg: self-tests for crct10dif using crct10dif failed (rc=-22)
  Oct 09 14:56:39 kernel: WARNING: CPU: 55 PID: 10553 at crypto/testmgr.c:5837 alg_test.cold+0x3e/0x124
  ...
  Oct 09 14:56:42 kernel: R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000040000
  Oct 09 14:56:42 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

and:
  Oct 10 19:45:02 kernel: sysrq: Show backtrace of all active CPUs
  ...
  Oct 10 19:45:03 kernel:  </TASK>
  ...
  Oct 10 19:45:06 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

If there are no WARNs or call traces, then there are no rcu stalls.

So, we should be aware that commit 09a5ef9644bc0e1 ("crypto: testmgr - WARN
on test failure") might introduce additional errors.

> BTW, the way tcrypt always refuses to load leads to an ever-growing list in
> the Call Traces:
> 
> kernel: Unloaded tainted modules: tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 t
> crypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
>  tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():
> 1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1 tcrypt():1
> tcrypt():1 tcrypt():1 tcrypt():1 tcrypt()
> :1 tcrypt():1 tcrypt():1 tcrypt():1

That was already noticed and a fix has already been posted for 6.1
as commit 47cc75aa9283 ("module: tracking: Keep a record of tainted unloaded
modules only")