Re: [PATCH v3] usb: warm-reset ports on hub resume, if requested

Jan-Marek Glogowski <glogow@xxxxxxxxxx> · Thu, 31 Jan 2019 19:52:39 +0100

Am 31. Januar 2019 17:53:13 MEZ schrieb Alan Stern <stern@xxxxxxxxxxxxxxxxxxx>:
>This thread has been hard to follow.  I'll try to answer some of your
>concerns below...
>
>On Thu, 31 Jan 2019, Jan-Marek Glogowski wrote:
>
>> Am 31.01.19 um 15:56 schrieb Mathias Nyman:
>> > On 31.01.2019 12:42, Jan-Marek Glogowski wrote:
>> >> Am 30.01.19 um 17:56 schrieb Jan-Marek Glogowski:
>> >>> Am 30.01.19 um 15:58 schrieb Mathias Nyman:
>> >>>> On 30.01.2019 14:38, Jan-Marek Glogowski wrote:
>> >>>>
>> >>>> The hub->event bits are set when roothub polling timer function
>gets data from
>> >>>> hub_status_data(), writes the data to a status urb, and returns
>the urb.
>
>You have been concentrating on what happens with root hubs 
>(understandable, since that's where your problem occurs).  But in order
>
>to understand the code, you have to realize that it is meant to apply 
>to _all_ hubs, both root and external.
>
>> >>>> Normal operation:
>> >>>> usb_hcd_poll_rh_status()        // roothub polling timer
>function
>> >>>>    hcd->driver->hub_status_data(hcd, buffer)   ->
>xhci_hub_status_data()
>> >>>>      xhci_hub_status_data()      //returns buffer with bits set
>if a port needs attention
>> >>>>    if (buffer && urb)
>> >>>>      memcpy(urb->transfer_buffer, buffer)       // copy buffer
>bits to urb
>> >>>>      usb_hcd_giveback_urb(urb)  -> hub_irq      // calls urb
>complete callback function
>> >>>>        hub_irq()
>> >>>>          hub->event_bits = urb->transfer_buffer //
>> >>>>    mod_timer(usb_hcd_poll_rh_status, future)    // reschedule
>roothub polling timer
>> >>>>
>> >>>> I'm guessing usb_hcd_poll_rh_status() never properly completes a
>URB in your case.
>
>It's not just a question of a timer expiring.  If we were dealing with 
>an external hub, the code would have to wait until an actual USB 
>transfer had completed.  There's no way to avoid that wait.
>
>> >>>> This could be possible if roothub is polled before properly
>resumed, then the URB wouldn't
>> >>>> be available yet, and no hub->event_bits are set.
>> >>>> Because no bits are set the hub starts suspending again after
>resume, probably before roothub
>> >>>> timer polls roothub again.
>> >>>> suspend fails when it manually checks
>hcd->driver->hub_status_data() for activity, and
>> >>>> it start resuming again, and continues this in a loop.
>> >>>>
>> >>>> Either roothub polling is stopped at the suspend attempt, or
>maybe continuously rescheduled
>> >>>> into the future because as a part of suspend attempt/resume
>loop.
>> >>
>> >> So as you expected the status_urb is always NULL.
>> >>
>> >> I tried to understand more of the code by adding some more debug
>message.
>> >>
>> >> I looked at usb_hcd_resume_root_hub, which is called when we see
>the "resume root hub" dmesg.
>> >> This just queues the hcd->wakeup_work, so indeed, if
>usb_hcd_poll_rh_status is later called,
>> >> the hub might not have resumed yet and polling starts. So my
>simple approach was to add a
>> >> function to call flush_work(&hcd->wakeup_work) before
>usb_hcd_poll_rh_status. All variants I
>> >> tested killed the kernel. I also tried to add a udelay to no
>avail.
>> > 
>> > I think the idea is that if usb_hcd_poll_rh_status() finds a port
>that needs attention, but
>> > the is URB missing, then a HCD_FLAG_POLL_PENDING flag is set.
>> > 
>> > This flag is checked in resume when submitting the hub URB, and
>idea is to react on it:
>> > 
>> > hub_activate(HUB_RESUME)
>> >     usb_submit_urb(hub->urb)
>> >      rh_urb_enqueue()
>> >        rh_queue_status()
>> >           hcd->status_urb = urb;
>> >           if (HCD_POLL_PENDING(hcd)
>> >              mod_timer(&hcd->rh_timer, jiffies);
>> > 
>> > It tries to trigger usb_hcd_poll_rh_status() immediately by
>modifying the timer,
>> > but from your logs it seems that it's not fast enough:
>> > 
>> > [  485.301806] usbcore:hub_resume: hub 2-0:1.0: hub_resume
>> > [  485.301819] xhci_hcd:xhci_hub_control: xhci_hcd 0000:00:14.0:
>get port status, actual port 0 status  = 0x2a0
>> >  ...
>> > [  485.301936] usbcore:hub_event: hub 2-0:1.0: state 7 ports 6 chg
>0000 evt 0000
>> > [  485.301952] xhci_hcd:xhci_hub_control: xhci_hcd 0000:00:14.0:
>set port remote wake mask, actual port 0 status  = 0xe0002a0
>> > ...
>> > [  485.302034] usbcore:hub_suspend: hub 2-0:1.0: hub_suspend
>> > [  485.302048] usbcore:hcd_bus_suspend: usb usb2: bus auto-suspend,
>wakeup 1
>> > [  485.302067] usbcore:hcd_bus_suspend: usb usb2: suspend raced
>with wakeup event
>> > [  485.302071] usbcore:hcd_bus_resume: usb usb2: usb auto-resume
>> > [  485.309765] usb usb2: usb_hcd_poll_rh_status, length 1,
>hcd->flags 0x18d, status_urb NULL
>> > 
>> > hcd->flags 0x18d means HCD_FLAG_POLL_PENDING  was set, but
>hub_event() workqueue and hub_suspend()
>> > are run before polling the root hub, and I guess the URB gets
>removed at some point during suspsend.
>> > 
>> > This then goes on in a loop.
>> > 
>> > You could try running usb_hcd_poll_rh_status() direcly instead of
>just kicking the timer and see if
>> > it works:
>> > 
>> > @@ -822,8 +826,11 @@ static int rh_queue_status (struct usb_hcd
>*hcd, struct urb *urb)
>> >                 mod_timer(&hcd->rh_timer, (jiffies/(HZ/4) + 1) *
>(HZ/4));
>> >  
>> >         /* If a status change has already occurred, report it ASAP
>*/
>> > -       else if (HCD_POLL_PENDING(hcd))
>> > -               mod_timer(&hcd->rh_timer, jiffies);
>> > +       else if (HCD_POLL_PENDING(hcd)) {
>> > +               spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
>> > +               usb_hcd_poll_rh_status(hcd);
>> > +               return 0;
>> > +       }
>> >         retval = 0;
>> >   done:
>> >         spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
>> > 
>> > 
>> > I didn't check locking or running contexts if it's ok to do this,
>there's could be a reason
>> > why it wasn't done like this in the first place.
>> > Worth a shot still.
>
>It's not a good idea.  Deferring the operation to a timer keeps things
>decoupled (running in different contexts with different stacks); doing
>a direct call could get into a loop of nested calls that would keep
>going deeper and deeper until it exhausted the kernel stack.
>
>Perhaps you didn't notice that at the end, hub_activate() calls
>kick_hub_wq().  That routine calls
>usb_autopm_get_interface_no_resume(), which will prevent the hub trim
>suspending until hub_event() calls usb_autopm_put_interface().  
>Therefore to make things work correctly, hub_activate() has to ensure
>that the hub->event_bits and hub->change_bits fields are set correctly,
>so that hub_event() will do the right thing when it runs (before the 
>hub is allowed to go back into suspend).
>
>Therefore if a port is in a funny state, hub_activate() should detect
>this and set one of these bit fields appropriately.  That's one reason
>why it contains all those tests for portstatus and portchange.  If a
>missing test needs to be added, that's what you should do.

That would be like the v3 of my patch, which started this thread and detected and handled the warm-reset request of the port in hub_activate() on resume. Maybe we can set some of these bits instead, or its already handled by the hub_port_reset() call; have to check.

I also included a patch in some previous mail after Mathias Nyman commented about the NULL status_urb, which added usb_hcd_poll_rh_status directly after the root hubs usb_remote_wakeup call.

A lot of stuff is deferred via workqueue or polled. As I understand it - in my case - resume, suspend and status race and miss each other in some endless loop.

As things are, I just have the new desktop HWs and a single usb-c device to test the ports of them. I have no idea what would happen with an additional non-root hub, but I can probably organize one, if I have to test it.

Jan-Marek