Re: RCU stall and/or host->device broken transition

Ferry Toth <fntoth@xxxxxxxxx> · Sat, 19 Jun 2021 15:35:30 +0200

Hi

Op 18-06-2021 om 09:31 schreef Andy Shevchenko:
On Fri, Jun 18, 2021 at 10:15 AM Ferry Toth <fntoth@xxxxxxxxx> wrote:
Op 16-06-2021 om 20:37 schreef Andy Shevchenko:
There is a bug somewhere and I have already spent too much time
whacking the mole. Need some hints, ideas on how to debug, etc.

So, my setup is that:
platform: Intel Merrifield
USB host and device controller: DWC3
USB host connected device: Diolan DLN-2 IO expander
In device mode: USB EEM
I'd like to try to reproduce, but don't have the Diolan DLN-2. Is that
essential to generating the stall? What if you just plug a USB stick?
It would be actually nice to have a reproducer on the other hardware,
this will eliminate one thing from the equation.

I have the SMSC9514 4p USB + 1 eth hub connected with external power, 
with a USB stick plugged into the hub. But can not reproduce see below.

Also tried with USB stick connected directly (powered from port), same.

The workflow
~~~~~~~~~~
The following steps to reproduce the issue (with some patches applied
/ reverted it may give different results from less than 10% up to
almost 100% of reproducibility):

(Note the board has a manual switch of an ID pin and two connectors
depending on the role)
1. Connect the DLN-2 to USB A
2. Connect the PC machine to USB micro-B
3. Boot the board in device mode
I've tested by booting in host mostly, so may have missed this case. Did
you find it is essential to boot in device mode to generate the stall?
No idea, I haven't tried the other way around.

I now tried booting directly into device mode and then switching to host 
and back multiple time. I can not reproduce the RCU stall up to now.

root@yuna:~# journalctl -k | grep -i rcu
kernel: rcu: Hierarchical RCU implementation.
kernel: rcu:         RCU event tracing is enabled.
kernel: rcu:         RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=2.
kernel:         Tracing variant of Tasks RCU enabled.
kernel: rcu: RCU calculated value of scheduler-enlistment delay is 100 
jiffies.
kernel: rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
kernel: rcu: Hierarchical SRCU implementation.

4. Setup USB EEM gadget via ConfigFS
5. Turn the switch (see above note) to the host and back to the device
mode as many times (usually a very few are needed) and with
(semi-)arbitrary time in between (usually 2-5 seconds) till the issue
occurs.
2 sec is a bit faster then I tested. Reason is that in user space in my
case connman needs a bit of time to actually establish an ethernet
connection to the PC. Are you using connman or other network manager?
Nope, I have a pure shell by Busybox.

So we either have different kernel, difference caused by user space, or 
a difference caused by different hardware plugged.

I think you already looked into the differences between our kernels.

Maybe you could try plugging just USB stick in place of DLN-2, then we 
can eliminate that.

On the PC side I have network manager providing a shared connection and
when all is well I see on the PC side a notification of the connection
being established. Do you have something similar?
On the PC side I have NetworkManager, but it shouldn't matter, because
I have set up the connection manually on the host side and by default
it doesn't establish it automatically.

The issue
~~~~~~~~
When switching from the host to the gadget mode
- the status LED on the DLN-2 board continues blinking (Vbus is on)
This looks like something I had before. Can you try switching very fast
host-gadget-host-gadget ( like in a 1 sec.)?
IIRC I have tried this and have found no correlation between frequency
of switching and failure rate. But I can try again. (Currently I have
switched to another  task, so it might be delayed)
On my end switching quickly back and forth doesn't trigger any errors 
now, that issue seems resolved for good.

- the driver may be still xhci (rarely), but often already switched to
dwc3
- due to above `lsusb` (rarely) still shows hub devices
- the EEM does not appear (PC doesn't see network iface)
- in ~50% cases RCU stall (there are only two CPUs)

RCU stall
~~~~~~~~
cat /proc/interrupts
16: 176129 0 IO-APIC 34-fasteoi xhci-hcd:usb1

[434689.740982] rcu: INFO: rcu_sched self-detected stall on CPU
[434689.746856] rcu: 0-....: (24661 ticks this GP)
idle=ee2/1/0x4000000000000000 softirq=4162/4163 fqs=5238
[434689.756867] (t=21000 jiffies g=8489 q=52)
[434689.761204] NMI backtrace for cpu 0
[434689.764919] CPU: 0 PID: 222 Comm: kworker/0:0 Not tainted
5.13.0-rc5+ #4
[434689.771948] Hardware name: Intel Corporation Merrifield/BODEGA
BAY, BIOS 542 2015.01.21:18.19.48
[434689.781126] Workqueue: usb_hub_wq hub_event
[434689.785590] Call Trace:
[434689.788232] <IRQ>
[434689.790431] dump_stack+0x69/0x8e
[434689.793996] nmi_cpu_backtrace.cold+0x32/0x69
[434689.798612] ? lapic_can_unplug_cpu+0x80/0x80
[434689.803249] nmi_trigger_cpumask_backtrace+0x8a/0xa0
[434689.808508] rcu_dump_cpu_stacks+0xbf/0xed
[434689.812873] rcu_sched_clock_irq.cold+0xc7/0x1e9
[434689.817805] update_process_times+0x8c/0xc0
[434689.822255] tick_sched_handle+0x34/0x50
[434689.826432] tick_sched_timer+0x7a/0xd0
[434689.830513] ? get_cpu_iowait_time_us+0x110/0x110
[434689.835499] __hrtimer_run_queues+0x157/0x350
[434689.840171] hrtimer_interrupt+0x110/0x2c0
[434689.844569] __sysvec_apic_timer_interrupt+0x76/0x150
[434689.849911] sysvec_apic_timer_interrupt+0x2f/0x90
[434689.854990] asm_sysvec_apic_timer_interrupt+0x12/0x20
[434689.860415] RIP: 0010:_raw_spin_unlock_irqrestore+0x37/0x50
[434689.866294] Code: 53 48 89 f3 48 8b 74 24 10 e8 45 fa 3a ff 48 89
ef e8 6d 23 3b ff 9c 58 f6 c4 02 75 10 80 e7 02 74 01 fb 65 ff 0d e9
d3 4f 78 <5b> 5d c3 e8 b1 3b ff ff eb e9 66 66 2e 0f 1f 84 00 00 00 00
00 0f
[434689.885773] RSP: 0000:ffff9ac140003da8 EFLAGS: 00000206
[434689.891294] RAX: 0000000000000046 RBX: 0000000000000202 RCX:
ffff8b7e4ad208d0
[434689.898775] RDX: 0000000062c241c0 RSI: 0000000042fee84d RDI:
ffffffff89460720
[434689.906255] RBP: ffffffff89460720 R08: 00000000897fe2f0 R09:
0000000000000001
[434689.913733] R10: 0000000000000000 R11: ffff8b7e46226db0 R12:
ffff8b7e4b29c000
[434689.921214] R13: 0000000000000000 R14: ffff8b7e4b29c2f8 R15:
ffff8b7e42e65780
[434689.928798] debug_dma_unmap_page+0x7e/0x90
[434689.933324] ? __lock_acquire.constprop.0+0x27d/0x550
[434689.938681] ? find_held_lock+0x2b/0x80
[434689.942787] usb_hcd_unmap_urb_for_dma+0x65/0x100
[434689.947785] __usb_hcd_giveback_urb+0x4b/0x100
[434689.952513] usb_giveback_urb_bh+0xa7/0x100
[434689.956994] tasklet_action_common.constprop.0+0xd0/0x140
[434689.962712] __do_softirq+0xeb/0x2dd
[434689.966565] irq_exit_rcu+0x95/0xc0
[434689.970288] common_interrupt+0x7b/0xa0
[434689.974375] </IRQ>
[434689.976670] asm_common_interrupt+0x1e/0x40
[434689.981109] RIP: 0010:_raw_spin_unlock_irq+0x27/0x30
[434689.986359] Code: 35 ff 90 0f 1f 44 00 00 55 48 8b 74 24 08 48 89
fd 48 8d 7f 18 e8 79 fa 3a ff 48 89 ef e8 a1 23 3b ff fb 65 ff 0d 29
d4 4f 78 <5d> c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 fd 48 83
c7 18
[434690.005836] RSP: 0000:ffff9ac140293b98 EFLAGS: 00000246
[434690.011352] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffff8b7e4ad208d0
[434690.018832] RDX: 0000000062c241c0 RSI: 0000000042fee84d RDI:
ffffffff8881ab00
[434690.026309] RBP: ffffffff8881ab00 R08: 00000000897fe2f0 R09:
0000000000000001
[434690.033790] R10: ffff8b7e7e23ac30 R11: 0000000000000000 R12:
0000000000000000
[434690.041268] R13: ffff8b7e4b29c000 R14: ffff8b7e4b40c018 R15:
ffff8b7e4af62950
[434690.048840] ? _raw_spin_unlock_irq+0x1f/0x30
[434690.053465] usb_hcd_submit_urb+0x294/0xbf0
[434690.057948] ? lockdep_init_map_type+0x47/0x220
[434690.062805] usb_start_wait_urb+0x65/0x160
[434690.067230] usb_control_msg+0xdf/0x140
[434690.071355] hub_event+0x8f4/0x1890
[434690.075184] ? lock_acquire+0x93/0x150
[434690.079175] ? process_one_work+0x1bc/0x4b0
[434690.083652] process_one_work+0x24d/0x4b0
[434690.087970] worker_thread+0x55/0x3c0
[434690.091875] ? rescuer_thread+0x390/0x390
[434690.096154] kthread+0x137/0x150
[434690.099612] ? __kthread_bind_mask+0x60/0x60
[434690.104166] ret_from_fork+0x22/0x30

followed by Watchdog one

[434725.296748] Kernel panic - not syncing: Kernel Watchdog
[434725.302247] CPU: 0 PID: 222 Comm: kworker/0:0 Not tainted
5.13.0-rc5+ #4
[434725.309242] Hardware name: Intel Corporation Merrifield/BODEGA
BAY, BIOS 542 2015.01.21:18.19.48
[434725.318367] Workqueue: usb_hub_wq hub_event
[434725.322801] Call Trace:
[434725.325424] <IRQ>
[434725.327607] dump_stack+0x69/0x8e
[434725.331145] panic+0x102/0x2d1
[434725.334460] mid_wdt_irq+0x11/0x11
[434725.338072] __handle_irq_event_percpu+0x65/0x1c0
[434725.343048] handle_irq_event+0x55/0xb0
[434725.347124] handle_fasteoi_irq+0x78/0x1d0
[434725.351461] __common_interrupt+0x3e/0xa0
[434725.355713] common_interrupt+0x3b/0xa0
[434725.359784] asm_common_interrupt+0x1e/0x40
[434725.364192] RIP: 0010:_raw_spin_lock_irqsave+0x0/0x50
[434725.369501] Code: 8d 7f 18 45 31 c9 31 c9 41 b8 01 00 00 00 31 d2
31 f6 e8 03 ff 3a ff 48 89 ef 58 5d e9 99 24 3b ff 66 0f 1f 84 00 00
00 00 00 <0f> 1f 44 00 00 41 54 55 48 89 fd 9c 41 5c fa 65 ff 05 ea d5
4f 78
[434725.388866] RSP: 0000:ffff9ac140003d00 EFLAGS: 00000202
[434725.394347] RAX: 0000000000000028 RBX: 00000000002b5ea8 RCX:
ffff8b7e4ad208d0
[434725.401779] RDX: 0000000062c241c0 RSI: 0000000042fee84d RDI:
ffffffff887579e0
[434725.409211] RBP: 0000000000000282 R08: 00000000897fe2f0 R09:
0000000000000001
[434725.416641] R10: 0000000000000000 R11: ffff8b7e46226db0 R12:
ffff8b7e41282500
[434725.424073] R13: ffff8b7e41281f00 R14: 0000000000000200 R15:
000000000ad7aa00
[434725.431582] add_dma_entry+0xa9/0x1d0
[434725.435487] dma_map_page_attrs+0xd8/0x220
[434725.439847] usb_hcd_map_urb_for_dma+0x3b6/0x4f0
[434725.444734] usb_hcd_submit_urb+0x98/0xbf0
[434725.449059] ? __lock_acquire.constprop.0+0x27d/0x550
[434725.454378] ? find_held_lock+0x2b/0x80
[434725.458493] dln2_rx+0x1d6/0x2b0 [dln2]
[434725.462595] __usb_hcd_giveback_urb+0x91/0x100
[434725.467291] usb_giveback_urb_bh+0xa7/0x100
[434725.471740] tasklet_action_common.constprop.0+0xd0/0x140
[434725.477420] __do_softirq+0xeb/0x2dd
[434725.481240] irq_exit_rcu+0x95/0xc0
[434725.484939] common_interrupt+0x7b/0xa0
[434725.489000] </IRQ>
[434725.491275] asm_common_interrupt+0x1e/0x40
[434725.495682] RIP: 0010:_raw_spin_unlock_irq+0x27/0x30
[434725.500898] Code: 35 ff 90 0f 1f 44 00 00 55 48 8b 74 24 08 48 89
fd 48 8d 7f 18 e8 79 fa 3a ff 48 89 ef e8 a
1 23 3b ff fb 65 ff 0d 29 d4 4f 78 <5d> c3 0f 1f 80 00 00 00 00 0f 1f
44 00 00 55 48 89 fd 48 83 c7 18
[434725.520261] RSP: 0000:ffff9ac140293b98 EFLAGS: 00000246
[434725.525742] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffff8b7e4ad208d0
[434725.533174] RDX: 0000000062c241c0 RSI: 0000000042fee84d RDI:
ffffffff8881ab00
[434725.540606] RBP: ffffffff8881ab00 R08: 00000000897fe2f0 R09:
0000000000000001
[434725.548037] R10: ffff8b7e7e23ac30 R11: 0000000000000000 R12:
0000000000000000
[434725.555469] R13: ffff8b7e4b29c000 R14: ffff8b7e4b40c018 R15:
ffff8b7e4af62950
[434725.562975] ? _raw_spin_unlock_irq+0x1f/0x30
[434725.567571] usb_hcd_submit_urb+0x294/0xbf0
[434725.572017] ? lockdep_init_map_type+0x47/0x220
[434725.576831] usb_start_wait_urb+0x65/0x160
[434725.581213] usb_control_msg+0xdf/0x140
[434725.585302] hub_event+0x8f4/0x1890
[434725.589087] ? lock_acquire+0x93/0x150
[434725.593053] ? process_one_work+0x1bc/0x4b0
[434725.597494] process_one_work+0x24d/0x4b0
[434725.601776] worker_thread+0x55/0x3c0
[434725.605657] ? rescuer_thread+0x390/0x390
[434725.609904] kthread+0x137/0x150
[434725.613342] ? __kthread_bind_mask+0x60/0x60
[434725.617862] ret_from_fork+0x22/0x30
[434725.622038] Kernel Offset: 0x5e00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[434725.633177] ---[ end Kernel panic - not syncing: Kernel Watchdog ]---

Any ideas, hints, etc are welcome!