Re: 4.14 dwc3 gadget mode panic

Vincent Pelletier <plr.vincent@xxxxxxxxx> · Sat, 25 Nov 2017 16:39:52 +0000

On Fri, 24 Nov 2017 14:04:35 +0000, Vincent Pelletier
<plr.vincent@xxxxxxxxx> wrote:
> What seem to be the relevant pieces are:
> - at least one AIO transfers submitted for reading from EP2OUT
> - upon receiving data from stdin, a synchronous write happens on EP2IN,
>   which blocks if host did not submit a transfer (normal)
> - SIGQUIT to interrupt the write while it's blocking

I could simplify the issue requirements: it is not required to use
synchronous IO. Submitting AIO and killing functionfs-interfaced
process is sufficient. This time, I kill with signal 15, which makes
the process exit without (in my understanding) returning control to
python (no sighandler set).

For some reason, I could now get kgdb to start with modified module
loaded, and inspect the status a bit more.

To my surprise, the error symptom do not seem to change:

(gdb) lx-dmesg
[...]
[  382.505547] BUG: scheduling while atomic: swapper/1/0/0x00000100
[  382.511767] 3 locks held by swapper/1/0:
[  382.515903]  #0:  (rcu_callback){....}, at: [<c10b4ff0>] rcu_process_callbacks+0x260/0x440
[  382.524572]  #1:  (rcu_read_lock_sched){....}, at: [<c1358ba0>] percpu_ref_switch_to_atomic_rcu+0xb0/0x130
[  382.534650]  #2:  (&(&ctx->ctx_lock)->rlock){....}, at: [<c11f0c73>] free_ioctx_users+0x23/0xd0
[  382.534699] Modules linked in: usb_f_fs libcomposite configfs bnep btsdio bluetooth ecdh_generic brcmfmac brcmutil dwc3 ulpi intel_powerclamp udc_core coretemp kvm_intel kvm irqbypass crc32_pclmul crc32c_intel pcbc dwc3_pci aesni_intel aes_i586 crypto_simd cryptd ehci_pci ehci_hcd basincove_gpadc gpio_keys industrialio usbcore usb_common
[  382.534945] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.0-edison+ #117
[  382.534955] Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542 2015.01.21:18.19.48
[  382.534963] Call Trace:
[  382.534976]  <SOFTIRQ>
[  382.535004]  dump_stack+0x47/0x62
[  382.535034]  __schedule_bug+0x61/0x80
[  382.535057]  __schedule+0x43/0x7a0
[  382.535116]  schedule+0x5f/0x70
[  382.535155]  dwc3_gadget_ep_dequeue+0x14c/0x270 [dwc3]
[  382.535196]  ? do_wait_intr_irq+0x70/0x70
[  382.535248]  usb_ep_dequeue+0x19/0x90 [udc_core]
[  382.535292]  ffs_aio_cancel+0x22/0x30 [usb_f_fs]
[  382.535312]  kiocb_cancel+0x31/0x40
[  382.535333]  free_ioctx_users+0x4d/0xd0
[  382.535367]  percpu_ref_switch_to_atomic_rcu+0x10a/0x130
[  382.535389]  ? percpu_ref_exit+0x40/0x40
[  382.535410]  rcu_process_callbacks+0x2b3/0x440
[  382.535460]  ? __softirqentry_text_start+0x8/0x8
[  382.535476]  __do_softirq+0xf8/0x26b
[  382.535517]  ? __softirqentry_text_start+0x8/0x8
[  382.535538]  do_softirq_own_stack+0x22/0x30
[  382.535547]  </SOFTIRQ>
[  382.535573]  irq_exit+0x45/0xc0
[  382.535591]  smp_apic_timer_interrupt+0x13c/0x150
[  382.535618]  apic_timer_interrupt+0x35/0x3c
[  382.535637] EIP: cpuidle_enter_state+0x26b/0x330
[  382.535647] EFLAGS: 00200202 CPU: 1
[  382.535657] EAX: 00000000 EBX: c1bfd680 ECX: 00000002 EDX: 00000000
[  382.535668] ESI: 00016e40 EDI: 00000059 EBP: f6911f60 ESP: f6911f2c
[  382.535679]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[  382.535792]  cpuidle_enter+0x14/0x20
[  382.535815]  call_cpuidle+0x35/0x40
[  382.535836]  do_idle+0x114/0x180
[  382.535862]  cpu_startup_entry+0x25/0x30
[  382.535884]  start_secondary+0x191/0x1a0
[  382.535915]  startup_32_smp+0x164/0x166
[  382.536327] BUG: unable to handle kernel NULL pointer dereference at   (null)
[  382.536351] IP: 0xf6909f43
[  382.536357] *pde = 00000000 
[  382.536372] Oops: 0000 [#1] SMP
[  393.811964] WARNING: stack recursion on stack type 3

I tried to confirm I'm running the modified kernel:
(gdb) disassemble ffs_aio_cancel
Dump of assembler code for function ffs_aio_cancel:
   0xf819c6f0 <+0>:     lea    %ds:0x0(%esi,%eiz,1),%esi
   0xf819c6f5 <+5>:     mov    0x10(%eax),%edx
   0xf819c6f8 <+8>:     test   %edx,%edx
   0xf819c6fa <+10>:    je     0xf819c718 <ffs_aio_cancel+40>
   0xf819c6fc <+12>:    mov    0x4c(%edx),%eax
   0xf819c6ff <+15>:    test   %eax,%eax
   0xf819c701 <+17>:    je     0xf819c718 <ffs_aio_cancel+40>
   0xf819c703 <+19>:    mov    0x50(%edx),%edx
   0xf819c706 <+22>:    test   %edx,%edx
   0xf819c708 <+24>:    je     0xf819c718 <ffs_aio_cancel+40>
   0xf819c70a <+26>:    push   %ebp
   0xf819c70b <+27>:    mov    %esp,%ebp
   0xf819c70d <+29>:    call   0xf80ba220 <usb_ep_dequeue>
   0xf819c712 <+34>:    pop    %ebp
   0xf819c713 <+35>:    ret
   0xf819c714 <+36>:    lea    0x0(%esi,%eiz,1),%esi
   0xf819c718 <+40>:    mov    $0xffffffea,%eax
   0xf819c71d <+45>:    ret
End of assembler dump.

Although I do not speak x86 fluently, I would expect several more
instructions or calls if IRQs were being disabled, so I believe I am
running the modified version.

I was surprised to see a high-pid process about dwc3:
(gdb) info threads
[...]
  105  Thread 1852 (htop) 0x00000000 in ?? () <- from where I killed python (signal 15)
  106  Thread 1879 (strace) 0x00000000 in ?? () <- strace python [...]
  107  Thread 1881 (python) 0x00000000 in ?? () <- python, interfacing with functionfs
  108  Thread 1882 (irq/34-dwc3) 0x00000000 in ?? ()

So I grabbed the backtraces of both the dwc3 thread and python:

(gdb) t 108
[Switching to thread 108 (Thread 1882)]
#0  0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0xf41cd340 in ?? ()
#2  0xc18474af in schedule () at kernel/sched/core.c:3425
#3  0xc10a99c5 in irq_wait_for_interrupt (action=<optimized out>) at kernel/irq/manage.c:768
#4  irq_thread (data=0x0) at kernel/irq/manage.c:972
#5  0xc106d5e1 in kthread (_create=0xf4f52c00) at kernel/kthread.c:231
#6  0xc184cc41 in ret_from_fork () at arch/x86/entry/entry_32.S:293

Normal IRQ handler, I guess. Maybe it only has a higher pid because
gadget became necessarily active after python script opened endpoints,
and handler would only be forked when it becomes needed ?

(gdb) t 107
[Switching to thread 107 (Thread 1881)]
#0  0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0xf568d600 in ?? ()
#2  0xc18474af in schedule () at kernel/sched/core.c:3425
#3  0xc184b5ca in schedule_timeout (timeout=<optimized out>) at kernel/time/timer.c:1720
#4  0xc184848b in do_wait_for_common (state=<optimized out>, timeout=2147483647, action=<optimized out>, x=<optimized out>) at kernel/sched/completion.c:91
#5  __wait_for_common (state=<optimized out>, timeout=<optimized out>, action=<optimized out>, x=<optimized out>) at kernel/sched/completion.c:112
#6  wait_for_common (state=<optimized out>, timeout=<optimized out>, x=<optimized out>) at kernel/sched/completion.c:123
#7  wait_for_completion (x=0xf4e25df8) at kernel/sched/completion.c:144
#8  0xc11f0e8a in exit_aio (mm=0xf568d600) at fs/aio.c:903
#9  0xc104e8a8 in __mmput (mm=<optimized out>) at kernel/fork.c:925
#10 0xc1055da2 in exit_mm () at kernel/exit.c:544
#11 do_exit (code=<optimized out>) at kernel/exit.c:852
#12 0xc10564c7 in do_group_exit (exit_code=15) at kernel/exit.c:968
#13 0xc1060463 in get_signal (ksig=<optimized out>) at kernel/signal.c:2334
#14 0xc101888e in do_signal (regs=0xf4e25fb4) at arch/x86/kernel/signal.c:809
#15 0xc1000fd3 in exit_to_usermode_loop (regs=0xf4e25fb4, cached_flags=7) at arch/x86/entry/common.c:158
#16 0xc10018c1 in prepare_exit_to_usermode (regs=<optimized out>) at arch/x86/entry/common.c:197
#17 syscall_return_slowpath (regs=<optimized out>) at arch/x86/entry/common.c:266
#18 do_syscall_32_irqs_on (regs=<optimized out>) at arch/x86/entry/common.c:335
#19 do_fast_syscall_32 (regs=0xf4e25fb4) at arch/x86/entry/common.c:391
#20 0xc184ccb3 in entry_SYSENTER_32 () at arch/x86/entry/entry_32.S:444
#21 0x00000008 in ?? ()
#22 0x01cad558 in ?? ()
#23 0x000003ff in tlv_data_is_valid (hdev=<optimized out>, adv_flags=4294967295, data=0x0, len=<error reading variable: Cannot access memory at address 0xbf80543c>, is_adv_data=8)
    at net/bluetooth/mgmt.c:6092
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

I'll try to continue debugging tomorrow, but pointers are very welcome.

Regards,
-- 
Vincent Pelletier
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html