On Wed, Mar 03, 2021 at 12:26:32PM +0200, Horia Geantă wrote: > Adding some people in the loop, maybe they could help in understanding > why lack of "dma-coherent" property for a HW-coherent device could lead to > unexpected / strange side effects. > > On 3/1/2021 5:22 PM, Sascha Hauer wrote: > > Hi All, > > > > I am on a Layerscape LS1046a using Linux-5.11. The CAAM driver sometimes > > crashes during the run-time self tests with: > > > >> kernel BUG at drivers/crypto/caam/jr.c:247! > >> Internal error: Oops - BUG: 0 [#1] PREEMPT SMP > >> Modules linked in: > >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.11.0-20210225-3-00039-g434215968816-dirty #12 > >> Hardware name: TQ TQMLS1046A SoM on Arkona AT1130 (C300) board (DT) > >> pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) > >> pc : caam_jr_dequeue+0x98/0x57c > >> lr : caam_jr_dequeue+0x98/0x57c > >> sp : ffff800010003d50 > >> x29: ffff800010003d50 x28: ffff8000118d4000 > >> x27: ffff8000118d4328 x26: 00000000000001f0 > >> x25: ffff0008022be480 x24: ffff0008022c6410 > >> x23: 00000000000001f1 x22: ffff8000118d4329 > >> x21: 0000000000004d80 x20: 00000000000001f1 > >> x19: 0000000000000001 x18: 0000000000000020 > >> x17: 0000000000000000 x16: 0000000000000015 > >> x15: ffff800011690230 x14: 2e2e2e2e2e2e2e2e > >> x13: 2e2e2e2e2e2e2020 x12: 3030303030303030 > >> x11: ffff800011700a38 x10: 00000000fffff000 > >> x9 : ffff8000100ada30 x8 : ffff8000116a8a38 > >> x7 : 0000000000000001 x6 : 0000000000000000 > >> x5 : 0000000000000000 x4 : 0000000000000000 > >> x3 : 00000000ffffffff x2 : 0000000000000000 > >> x1 : 0000000000000000 x0 : 0000000000001800 > >> Call trace: > >> caam_jr_dequeue+0x98/0x57c > >> tasklet_action_common.constprop.0+0x164/0x18c > >> tasklet_action+0x44/0x54 > >> __do_softirq+0x160/0x454 > >> __irq_exit_rcu+0x164/0x16c > >> irq_exit+0x1c/0x30 > >> __handle_domain_irq+0xc0/0x13c > >> gic_handle_irq+0x5c/0xf0 > >> el1_irq+0xb4/0x180 > >> arch_cpu_idle+0x18/0x30 > >> default_idle_call+0x3c/0x1c0 > >> do_idle+0x23c/0x274 > >> cpu_startup_entry+0x34/0x70 > >> rest_init+0xdc/0xec > >> arch_call_rest_init+0x1c/0x28 > >> start_kernel+0x4ac/0x4e4 > >> Code: 91392021 912c2000 d377d8c6 97f24d96 (d4210000) > > > > The driver iterates over the descriptors in the output ring and matches them > > with the ones it has previously queued. If it doesn't find a matching > > descriptor it complains with the BUG_ON() seen above. What I see sometimes is > > that the address in the output ring is 0x0, the job status in this case is > > 0x40000006 (meaning DECO Invalid KEY command). It seems that the CAAM doesn't > > write the descriptor address to the output ring at least in some error cases. > > When we don't have the descriptor address of the failed descriptor we have no > > way to find it in the list of queued descriptors, thus we also can't find the > > callback for that descriptor. This looks very unfortunate, anyone else seen > > this or has an idea what to do about it? > > > > I haven't investigated yet which job actually fails and why. Of course that would > > be my ultimate goal to find that out. > > > This looks very similar to an earlier report from Greg. > He confirmed that adding "dma-coherent" property to the "crypto" DT node > fixes the issue: > https://lore.kernel.org/linux-crypto/74f664f5-5433-d322-4789-3c78bdb814d8@xxxxxxxxxx > Patch rebased on v5.11 is at the bottom. Does it work for you too? Indeed this seems to solve it for me as well, you can add my Tested-by: Sascha Hauer <s.hauer@xxxxxxxxxxxxxx> However, there seem to be two problems: First that "DECO Invalid KEY command" actually occurs and second that the deqeueue code currently can't handle a NULL pointer in the output ring. Do you think that the occurence of a NULL pointer is also a coherency issue? Sascha -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |