Hi, On 25 March 2018 at 12:21, Jonathan Liu <net147@xxxxxxxxx> wrote: > On 8 February 2018 at 14:55, Jeffy Chen <jeffy.chen@xxxxxxxxxxxxxx> wrote: >> From: AMAN DEEP <aman.deep@xxxxxxxxxxx> >> >> There is a race condition between finish_unlinks->finish_urb() function >> and usb_kill_urb() in ohci controller case. The finish_urb calls >> spin_unlock(&ohci->lock) before usb_hcd_giveback_urb() function call, >> then if during this time, usb_kill_urb is called for another endpoint, >> then new ed will be added to ed_rm_list at beginning for unlink, and >> ed_rm_list will point to newly added. >> >> When finish_urb() is completed in finish_unlinks() and ed->td_list >> becomes empty as in below code (in finish_unlinks() function): >> >> if (list_empty(&ed->td_list)) { >> *last = ed->ed_next; >> ed->ed_next = NULL; >> } else if (ohci->rh_state == OHCI_RH_RUNNING) { >> *last = ed->ed_next; >> ed->ed_next = NULL; >> ed_schedule(ohci, ed); >> } >> >> The *last = ed->ed_next will make ed_rm_list to point to ed->ed_next >> and previously added ed by usb_kill_urb will be left unreferenced by >> ed_rm_list. This causes usb_kill_urb() hang forever waiting for >> finish_unlink to remove added ed from ed_rm_list. >> >> The main reason for hang in this race condtion is addition and removal >> of ed from ed_rm_list in the beginning during usb_kill_urb and later >> last* is modified in finish_unlinks(). >> >> As suggested by Alan Stern, the solution for proper handling of >> ohci->ed_rm_list is to remove ed from the ed_rm_list before finishing >> any URBs. Then at the end, we can add ed back to the list if necessary. >> >> This properly handle the updated ohci->ed_rm_list in usb_kill_urb(). >> >> Fixes:977dcfdc6031("USB:OHCI:don't lose track of EDs when a controller dies") >> Acked-by: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> >> CC: <stable@xxxxxxxxxxxxxxx> >> Signed-off-by: Aman Deep <aman.deep@xxxxxxxxxxx> >> Signed-off-by: Jeffy Chen <jeffy.chen@xxxxxxxxxxxxxx> >> --- >> >> Changes in v6: >> This is a resend of Aman Deep's v5 patch [0], which solved the hang we >> hit [1]. (Thanks Aman :) >> >> The v5 has some format issues, so i slightly adjust the commit message. >> >> [0] https://www.spinics.net/lists/linux-usb/msg129010.html >> [1] https://bugs.chromium.org/p/chromium/issues/detail?id=803749 >> >> drivers/usb/host/ohci-q.c | 17 ++++++++++------- >> 1 file changed, 10 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c >> index b2ec8c399363..4ccb85a67bb3 100644 >> --- a/drivers/usb/host/ohci-q.c >> +++ b/drivers/usb/host/ohci-q.c >> @@ -1019,6 +1019,8 @@ static void finish_unlinks(struct ohci_hcd *ohci) >> * have modified this list. normally it's just prepending >> * entries (which we'd ignore), but paranoia won't hurt. >> */ >> + *last = ed->ed_next; >> + ed->ed_next = NULL; >> modified = 0; >> >> /* unlink urbs as requested, but rescan the list after >> @@ -1077,21 +1079,22 @@ static void finish_unlinks(struct ohci_hcd *ohci) >> goto rescan_this; >> >> /* >> - * If no TDs are queued, take ED off the ed_rm_list. >> + * If no TDs are queued, ED is now idle. >> * Otherwise, if the HC is running, reschedule. >> - * If not, leave it on the list for further dequeues. >> + * If the HC isn't running, add ED back to the >> + * start of the list for later processing. >> */ >> if (list_empty(&ed->td_list)) { >> - *last = ed->ed_next; >> - ed->ed_next = NULL; >> ed->state = ED_IDLE; >> list_del(&ed->in_use_list); >> } else if (ohci->rh_state == OHCI_RH_RUNNING) { >> - *last = ed->ed_next; >> - ed->ed_next = NULL; >> ed_schedule(ohci, ed); >> } else { >> - last = &ed->ed_next; >> + ed->ed_next = ohci->ed_rm_list; >> + ohci->ed_rm_list = ed; >> + /* Don't loop on the same ED */ >> + if (last == &ohci->ed_rm_list) >> + last = &ed->ed_next; >> } >> >> if (modified) > > I am experiencing a USB function call hang from userspace with OCHI > (full speed USB device) after updating from Linux 4.14.15 to 4.14.24 > and noticed this commit. > > Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w > and amended with addr2line): > [<c05cbd64>] (__schedule) from [<c05cc148>] (schedule+0x50/0xb4) > kernel/sched/core.c:2792 > [<c05cc148>] (schedule) from [<c042789c>] > (usb_kill_urb.part.3+0x78/0xa8) include/asm-generic/preempt.h:59 > [<c042789c>] (usb_kill_urb.part.3) from [<c0433a3c>] > (usbdev_ioctl+0x1288/0x1cf0) drivers/usb/core/urb.c:690 > [<c0433a3c>] (usbdev_ioctl) from [<c0217db8>] > (do_vfs_ioctl+0x9c/0x8ec) drivers/usb/core/devio.c:1835 > [<c0217db8>] (do_vfs_ioctl) from [<c021863c>] (SyS_ioctl+0x34/0x5c) > fs/ioctl.c:47 > [<c021863c>] (SyS_ioctl) from [<c0107820>] (ret_fast_syscall+0x0/0x54) > include/linux/file.h:39 > > Afterwards the kernel is unresponsive to disconnect/connect of the > full speed USB device but I can connect/disconnect a high speed USB > device to the same port and communicate to it without problem since it > uses EHCI (OHCI is companion controller). If I try to connect the full > speed USB device again it is still unresponsive. The userspace > application is still hanging after all this. > > Could this commit be causing the issue? There does seem to be something wrong with this commit. I did some more testing with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime. I connected a sun7i-a20-pcduino3b running Linux 4.15.12 that has OTG port with gadget zero kernel module (g_zero) loaded to the USB port and did some testing with usbtest from tools/usb/testusb.c in kernel repository. I had to do the following change to avoid getting flooded with "usbtest 1-1:3.0: unlink retry" kernel messages after test 11 of testusb starts: --- a/drivers/usb/misc/usbtest.c +++ b/drivers/usb/misc/usbtest.c @@ -1473,7 +1473,7 @@ static int unlink1(struct usbtest_dev *dev, int pipe, int size, int async) * resubmission, but since we're testing unlink * paths, we can't. */ - ERROR(dev, "unlink retry\n"); + /*ERROR(dev, "unlink retry\n");*/ continue; case 0: case -EINPROGRESS: Here are the results. # ./testusb -a /dev/bus/usb/001/002 test 0, 0.006822 secs /dev/bus/usb/001/002 test 1, 0.140112 secs /dev/bus/usb/001/002 test 2, 0.381125 secs /dev/bus/usb/001/002 test 3, 0.133564 secs /dev/bus/usb/001/002 test 4, 0.385498 secs /dev/bus/usb/001/002 test 5, 2.138259 secs /dev/bus/usb/001/002 test 6, 1.755144 secs /dev/bus/usb/001/002 test 7, 2.137387 secs /dev/bus/usb/001/002 test 8, 1.752393 secs /dev/bus/usb/001/002 test 9, 1.132559 secs /dev/bus/usb/001/002 test 10, 3.698934 secs /dev/bus/usb/001/002 test 11, 40.061114 secs dmesg (processed through scripts/decode_stacktrace.sh from kernel repository): [ 21.573242] usbtest 1-1:3.0: TEST 0: NOP [ 21.580315] usbtest 1-1:3.0: TEST 1: write 1024 bytes 1000 times [ 21.732138] usbtest 1-1:3.0: TEST 2: read 1024 bytes 1000 times [ 22.132133] usbtest 1-1:3.0: TEST 3: write/1024 0..1024 bytes 1000 times [ 22.282137] usbtest 1-1:3.0: TEST 4: read/1024 0..1024 bytes 1000 times [ 22.682131] usbtest 1-1:3.0: TEST 5: write 1000 sglists 32 entries of 1024 bytes [ 24.832138] usbtest 1-1:3.0: TEST 6: read 1000 sglists 32 entries of 1024 bytes [ 26.602135] usbtest 1-1:3.0: TEST 7: write/1024 1000 sglists 32 entries 0..1024 bytes [ 28.752136] usbtest 1-1:3.0: TEST 8: read/1024 1000 sglists 32 entries 0..1024 bytes [ 30.522139] usbtest 1-1:3.0: TEST 9: ch9 (subset) control tests, 1000 times [ 31.654957] usbtest 1-1:3.0: TEST 10: queue 32 control calls, 1000 times [ 35.354220] usbtest 1-1:3.0: TEST 11: unlink 1000 reads of 1024 [ 75.414099] usbtest 1-1:3.0: TEST 12: unlink 1000 writes of 1024 [ 128.026194] INFO: rcu_sched self-detected stall on CPU [ 128.031367] 0-....: (1 GPs behind) idle=faa/1/1073741826 softirq=2614/2621 fqs=1048 [ 128.039182] (t=2100 jiffies g=227 c=226 q=1961) [ 128.043887] NMI backtrace for cpu 0 [ 128.047377] CPU: 0 PID: 262 Comm: testusb Tainted: G O 4.16.0-rc6 #1 [ 128.054932] Hardware name: Allwinner sun7i (A20) Family [ 128.060181] (unwind_backtrace) from show_stack (arch/arm/kernel/traps.c:249) [ 128.067921] (show_stack) from dump_stack (lib/dump_stack.c:55) [ 128.075142] (dump_stack) from nmi_cpu_backtrace (lib/nmi_backtrace.c:103) [ 128.083141] (nmi_cpu_backtrace) from nmi_trigger_cpumask_backtrace (lib/nmi_backtrace.c:62) [ 128.092704] (nmi_trigger_cpumask_backtrace) from rcu_dump_cpu_stacks (kernel/rcu/tree.c:1373 (discriminator 2)) [ 128.102351] (rcu_dump_cpu_stacks) from rcu_check_callbacks (kernel/rcu/tree.c:1526 kernel/rcu/tree.c:1592 kernel/rcu/tree.c:3361 kernel/rcu/tree.c:3423 kernel/rcu/tree.c:2763) [ 128.111304] (rcu_check_callbacks) from update_process_times (arch/arm/include/asm/thread_info.h:91 include/asm-generic/preempt.h:11 kernel/time/timer.c:1638) [ 128.120173] (update_process_times) from tick_sched_timer (kernel/time/tick-sched.c:1199) [ 128.128777] (tick_sched_timer) from __hrtimer_run_queues (kernel/time/hrtimer.c:1349 kernel/time/hrtimer.c:1411) [ 128.137553] (__hrtimer_run_queues) from hrtimer_interrupt (kernel/time/hrtimer.c:1472) [ 128.146331] (hrtimer_interrupt) from arch_timer_handler_phys (drivers/clocksource/arm_arch_timer.c:589 drivers/clocksource/arm_arch_timer.c:606) [ 128.155285] (arch_timer_handler_phys) from handle_percpu_devid_irq (kernel/irq/chip.c:909) [ 128.164843] (handle_percpu_devid_irq) from generic_handle_irq (kernel/irq/irqdesc.c:607) [ 128.173880] (generic_handle_irq) from __handle_domain_irq (kernel/irq/irqdesc.c:646) [ 128.182574] (__handle_domain_irq) from gic_handle_irq (arch/arm/include/asm/io.h:118 drivers/irqchip/irq-gic.c:360) [ 128.190920] (gic_handle_irq) from __irq_svc (arch/arm/kernel/entry-armv.S:226) [ 128.198382] Exception stack(0xcda1fc88 to 0xcda1fcd0) [ 128.203431] fc80: c0a95bc8 a0060013 0000000a 0000aa75 cd83b480 a0060013 [ 128.211598] fca0: cf7ee000 ffffff98 00000001 00000000 00000400 cec2b800 00000009 cda1fcd8 [ 128.219763] fcc0: c045b328 c0622890 20060013 ffffffff [ 128.224818] (__irq_svc) from _raw_spin_unlock_irqrestore (kernel/locking/spinlock.c:185) [ 128.233428] (_raw_spin_unlock_irqrestore) from usb_hcd_unlink_urb (drivers/usb/core/hcd.c:1720) [ 128.242816] (usb_hcd_unlink_urb) from unlink1 (drivers/usb/core/hcd.c:864 drivers/usb/core/hcd.c:1684) [ 128.250555] (unlink1) from unlink_simple (drivers/usb/misc/usbtest.c:1510) [ 128.257774] (unlink_simple) from usbtest_do_ioctl (drivers/usb/misc/usbtest.c:2360) [ 128.266033] (usbtest_do_ioctl) from usbtest_ioctl (drivers/usb/misc/usbtest.c:2685) [ 128.274206] (usbtest_ioctl) from usbdev_ioctl (drivers/usb/core/devio.c:2175 drivers/usb/core/devio.c:2197 drivers/usb/core/devio.c:2511 drivers/usb/core/devio.c:2555) [ 128.282200] (usbdev_ioctl) from do_vfs_ioctl (fs/ioctl.c:47 fs/ioctl.c:686) [ 128.289854] (do_vfs_ioctl) from SyS_ioctl (include/linux/file.h:39 fs/ioctl.c:702 fs/ioctl.c:692) [ 128.297159] (SyS_ioctl) from ret_fast_syscall (arch/arm/kernel/entry-common.S:52) [ 128.304716] Exception stack(0xcda1ffa8 to 0xcda1fff0) [ 128.309763] ffa0: 000119c4 00000000 00000003 c00c5512 bebbfb34 bebbfb34 [ 128.317931] ffc0: 000119c4 00000000 000109b0 00000036 00000000 00000000 b6fb0fac bebbfb44 [ 128.326096] ffe0: 00022034 bebbfb1c 00011020 b6ee073c If I revert the commit, I can run "./testusb -a" to completion without any issues with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime. I don't get a kernel stall detected when testing on x86_64 but I do still get at least one "unlink retry" error message from kernel in dmesg which can be reproduced using "./testusb -a" or more frequently with "./testusb -a -t 11 -c 5000". Regards, Jonathan