Re: [REGRESSION][BISECTED] 5.5-rc suspend/resume failure caused by patch a4f55d8b8c14 ("usb: hub: Check device descriptor before resusciation")

Paul Zimmerman <pauldzim@xxxxxxxxx> · Tue, 21 Jan 2020 19:31:31 -0700

On Mon, 20 Jan 2020 13:52:15 -0700 Paul Zimmerman <pauldzim@xxxxxxxxx> wrote:

> On Mon, 20 Jan 2020 10:23:11 -0500 (EST) Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:
>   
> > On Sun, 19 Jan 2020, Paul Zimmerman wrote:
> >     
> > > I reported this regression last week (see
> > > https://lore.kernel.org/linux-usb/20200115153714.03d5b3aa@EliteBook/T/#u)
> > > but I got no response to my email. Today I have retested with
> > > 5.5-rc7 and verified that the problem still exists. So I am
> > > resending with a different subject line to see if anyone responds.
> > > 
> > > The $subject patch causes a regression on my HP EliteBook laptop
> > > with a built-in USB bluetooth adapter. About 50% of the time, a
> > > suspend/resume cycle will cause the bluetooth adapter to stop
> > > working.
> > > 
> > > The dmesg log below shows two suspend/resume cycles. At time
> > > 63.928 you can see the bluetooth adapter being successfully
> > > resumed, and at time 140.969 you can see it fail. After reverting
> > > the patch, the bluetooth adapter resumes 100% of the time.
> > > 
> > > I also included below a lsusb -v of the bluetooth adapter. Is
> > > there any other debugging info you'd like me to send?      
> > 
> > It looks like your dmesg log was made without enabling debugging 
> > messages in usbcore.  Can you collect another log with debugging 
> > messages turned on?
> > 
> > 	echo 'module usbcore =p'    
> > >/sys/kernel/debug/dynamic_debug/control    
> > 
> > Also, it might not hurt to collect and post a usbmon trace for a bad
> > suspend-resume cycle.    
> 
> Hi Alan,
> 
> Thanks for responding. The new dmesg log and the usbmon trace are
> below. The dmesg shows a good suspend/resume followed by a bad one.
> The bluetooth device is usb 2-3.2 I believe. The usbmon trace is only
> for the failed suspend/resume case.  

I did some more debugging on this using ftrace, here is an annotated
log that shows what I think is happening on a failed resume. Note that
hub_port_connect_change() is the function that the patch modified to
call usb_get_device_descriptor() in some circumstances.

This first call to hub_port_connect_change() is for a different device,
not the failing one. There are multiple other calls to that function in
the trace, but they are not for the device in question:
     kworker/1:6-19987 [001]  7803.175058: funcgraph_entry:                      |  hub_port_connect_change() {
     kworker/1:6-19987 [001]  7803.175069: funcgraph_entry:      ! 11911.151 us  |    usb_get_device_descriptor();
     kworker/2:5-20675 [002]  7803.179333: funcgraph_entry:                      |  usb_probe_interface() {

Starting here is a worker thread that is trying to setup the bluetooth
adapter after the resume:
   kworker/u17:1-3175  [002]  7803.179457: funcgraph_entry:                      |  hci_power_on() {
   kworker/u17:1-3175  [002]  7803.179458: funcgraph_entry:                      |    hci_dev_do_open() {
   kworker/u17:1-3175  [002]  7803.179468: funcgraph_entry:                      |      btusb_setup_intel() {
     kworker/2:5-20675 [002]  7803.179480: funcgraph_exit:       ! 147.596 us    |  }

And here is a worker thread that is handling the connect change on the USB
port with the bluetooth device. This happens while the btusb_setup_intel()
function called by the other thread above is still running:
     kworker/3:1-16790 [003]  7803.181323: funcgraph_entry:                      |  hub_port_connect_change() {
     kworker/3:1-16790 [003]  7803.181330: funcgraph_entry:                      |    usb_get_device_descriptor() {
     kworker/0:2-19962 [000]  7803.184885: funcgraph_entry:        6.016 us      |  hub_port_connect_change();
     kworker/0:2-19962 [000]  7803.187208: funcgraph_entry:        5.462 us      |  hub_port_connect_change();
     kworker/1:6-19987 [001]  7803.187835: funcgraph_exit:       ! 12776.810 us  |  }
     kworker/1:6-19987 [001]  7803.187846: funcgraph_entry:      ! 767445.344 us |  hub_port_connect_change();

By this point the failure has happened ("Bluetooth: hci0: Reading Intel
version information failed (-110)") , and it looks like somebody
queues another work to try setting up the bluetooth adapter again:
       hciconfig-21074 [000]  7803.193549: funcgraph_entry:                      |  hci_dev_open() {
     kworker/1:6-19987 [001]  7803.955315: funcgraph_entry:      ! 144039.307 us |  hub_port_connect_change();
   kworker/u17:1-3175  [003]  7805.203325: funcgraph_exit:       ! 2023855 us    |      }
   kworker/u17:1-3175  [003]  7805.211728: funcgraph_exit:       ! 2032269 us    |    }
   kworker/u17:1-3175  [003]  7805.211731: funcgraph_exit:       ! 2032273 us    |  }
       hciconfig-21074 [000]  7805.211760: funcgraph_entry:                      |    hci_dev_do_open() {
       hciconfig-21074 [000]  7805.211777: funcgraph_entry:      ! 3069605 us    |      btusb_setup_intel();

And here the usb_get_device_descriptor() call made by the USB worker thread
finally completes. It also fails ("kworker/3:1 timed out on ep0in len=0/18"):
     kworker/3:1-16790 [003]  7808.277224: funcgraph_exit:       ! 5095893 us    |    }
       hciconfig-21074 [001]  7808.283313: funcgraph_exit:       ! 3071554 us    |    }
       hciconfig-21074 [001]  7808.283315: funcgraph_exit:       ! 5089767 us    |  }

So if I'm understanding this correctly, there are two threads that are
trying to access the USB bluetooth device at the same time. I have no
idea if that is how it's supposed to work.

Thanks,
Paul