[BUG] More issues with arm/aes-neonbs

"Russell King (Oracle)" <linux@xxxxxxxxxxxxxxx> · Mon, 5 Aug 2024 22:42:06 +0100

Hi,

I see there have been multiple attempts to fix this module, but sadly
it seems that the problems persist and are not fixed.

On my i.MX6 platforms, since 6.9, I enabled aes-arm-bs support, and
I've since been getting a load of hung tasks at boot. I've started to
try to debug this evening under 6.10 - involving hacking the kernel
code to try and get useful information out of the kernel. I've ended
up dumping the entire state of all threads when the hung task fires.

What I find is this - the aes-arm-neonbs module is being probed, and
this is its trace:

[   74.803096] task:modprobe        state:D stack:0     pid:613   tgid:613   ppid:37     flags:0x00000000
[   74.812620] Call trace:
[   74.812636] [<c0b784cc>] (__schedule) from [<c0b78bbc>] (schedule+0x50/0x128)
[   74.822586] [<c0b78bbc>] (schedule) from [<c0b82fac>] (schedule_timeout+0xb0/0x1b8)
[   74.830444] [<c0b82fac>] (schedule_timeout) from [<c0b79420>] (__wait_for_common+0x74/0x170)
[   74.839110] [<c0b79420>] (__wait_for_common) from [<c0488b8c>] (crypto_larval_wait+0x14/0x98)
[   74.847852] [<c0488b8c>] (crypto_larval_wait) from [<c0488e14>] (crypto_alg_mod_lookup+0x204/0x20c)
[   74.857118] [<c0488e14>] (crypto_alg_mod_lookup) from [<c0488f5c>] (crypto_alloc_tfm_node+0x48/0xb4)
[   74.866468] [<c0488f5c>] (crypto_alloc_tfm_node) from [<c048c478>] (crypto_alloc_skcipher+0x28/0x30)
[   74.875857] [<c048c478>] (crypto_alloc_skcipher) from [<bf3e88b8>] (cbc_init+0x1c/0x38 [aes_arm_bs])
[   74.885264] [<bf3e88b8>] (cbc_init [aes_arm_bs]) from [<c04889c0>] (crypto_create_tfm_node+0x34/0xd4)
[   74.894736] [<c04889c0>] (crypto_create_tfm_node) from [<c0488f74>] (crypto_alloc_tfm_node+0x60/0xb4)
[   74.894770] [<c0488f74>] (crypto_alloc_tfm_node) from [<c048c478>] (crypto_alloc_skcipher+0x28/0x30)
[   74.894800] [<c048c478>] (crypto_alloc_skcipher) from [<bf3de61c>] (simd_skcipher_create_compat+0x20/0x17c [crypto_simd])
[   74.894849] [<bf3de61c>] (simd_skcipher_create_compat [crypto_simd]) from [<bf3ef06c>] (aes_init+0x6c/0x1000 [aes_arm_bs])
[   74.894896] [<bf3ef06c>] (aes_init [aes_arm_bs]) from [<c0009ffc>] (do_one_initcall+0x60/0x2c0)
[   74.894933] [<c0009ffc>] (do_one_initcall) from [<c00e6640>] (do_init_module+0x54/0x1fc)
[   74.894962] [<c00e6640>] (do_init_module) from [<c00e8644>] (init_module_from_file+0x84/0xa4)
[   74.961860] [<c00e8644>] (init_module_from_file) from [<c00e892c>] (sys_finit_module+0x170/0x21c)
[   74.961897] [<c00e892c>] (sys_finit_module) from [<c0008320>] (ret_fast_syscall+0x0/0x1c)

What seems to be happening here is that we have registered all the
main ciphers using crypto_register_skciphers(), and then we walk the
array of algos, calling simd_skcipher_create_compat() on each.

We get to the __cbc(aes) entry, and this one seems to trigger the
larval_wait thing. With debug in crypto_alg_mod_lookup(), I find
this:

[   25.131852] modprobe:613: crypto_alg_mod_lookup: name=cbc(aes) type=0x5 mask=0x218e ok=32769
...
[   87.015070]   name=cbc(aes) alg=0xffffff92

and 0xffffff92 is an error-pointer for ETIMEDOUT.

i.MX6 does have the CAAM hardware that can do cbc(aes), so thinking
that may be the issue, I decided to try blacklisting the CAAM modules.
This made no difference.

It seems that the issue is centred around the aes-arm-bs module. Even
after boot, and having removed the module, manually reloading it also
causes the same problem:

# time modprobe aes-arm-bs
modprobe: ERROR: could not insert 'aes_arm_bs': Connection timed out

real    1m1.731s
user    0m0.004s
sys     0m0.052s

The interesting thing is... if I blacklist the aes-arm module, then
aes-arm-bs doesn't behave this way and loads successfully. If I pre-
load the aes-arm module, then the hanging behaviour returns.

So... with my debug in place, loading aes-arm-bs with aes-arm
blacklisted gives me:

[ 4289.026431] modprobe:1786: crypto_alg_mod_lookup: name=cbc(aes) type=0x5 mask=0x218e ok=32769
[ 4289.084516] cryptomgr_probe:1788: crypto_alg_mod_lookup: name=aes type=0x20004 mask=0x218f ok=0
[ 4289.084556]   name=aes alg=0xfffffffe
[ 4289.114602] cryptomgr_probe:1788: crypto_alg_mod_lookup: name=ecb(aes) type=0x20004 mask=0x218f ok=32769
[ 4289.163489] cryptomgr_probe:1793: crypto_alg_mod_lookup: name=aes type=0x20004 mask=0x218f ok=0
[ 4289.163530]   name=aes alg=0xfffffffe
[ 4289.165187]   name=ecb(aes) alg=0xc4b318c0
[ 4289.165367]   name=cbc(aes) alg=0xc4b31cc0

Hence, looking up "aes" returns an immediate -ENOENT (and this is the
only "name" that aes-arm provides.) With aes-arm loaded:

[ 3926.164204] modprobe:1691: crypto_alg_mod_lookup: name=cbc(aes) type=0x5 mask
=0x218e ok=32769
[ 3926.212563] cryptomgr_probe:1693: crypto_alg_mod_lookup: name=aes type=0x2000
4 mask=0x218f ok=0
[ 3926.212605]   name=aes alg=0xfffffffe
[ 3988.209746]   name=cbc(aes) alg=0xffffff92
[ 3988.412691] cryptomgr_probe:1693: crypto_alg_mod_lookup: name=ecb(aes) type=0x20004 mask=0x218f ok=32769
[ 3988.462116] cryptomgr_probe:1708: crypto_alg_mod_lookup: name=aes type=0x20004 mask=0x218f ok=0
[ 3988.462159]   name=aes alg=0xfffffffe
[ 3988.462292]   name=ecb(aes) alg=0xc4b320c0

It's interesting in the case where aes-arm is not loaded that the
cbc(aes) lookup only succeeds _after_ ecb(aes) has, but in the
failing case, we're clearly waiting for cbc(aes) before proceeding
to ecb(aes).

This is about as far as I've managed to get debugging this, and I'm
starting to hit the maze that is crypto probing/manager code that
isn't easy to understand... at least not on a late Monday evening.
Any suggestions?

Right now, though, from what I can see the aes-arm-bs module is
entirely unusable, and the only way I can get a reasonably bootable
system is to avoid loading this module (either by disabling it in
the kernel build or blacklisting it in modprobe - the latter being
my current solutions to this bug.)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!