Re: xHCI bug

Felipe Balbi <balbi@xxxxxx> · Mon, 10 Nov 2014 09:24:35 -0600

Hi,

On Fri, Nov 07, 2014 at 03:40:01PM +0200, Mathias Nyman wrote:
> On 07.11.2014 00:25, Felipe Balbi wrote:
> > On Thu, Nov 06, 2014 at 10:36:30AM -0600, Felipe Balbi wrote:
> >> On Thu, Nov 06, 2014 at 06:31:20PM +0200, Mathias Nyman wrote:
> >>> On 05.11.2014 21:28, Felipe Balbi wrote:
> >>>> Hi,
> >>>>
> >>>> On Tue, Oct 14, 2014 at 04:34:00PM +0300, Mathias Nyman wrote:
> >>>>>>>> Could you try with xhci debugging enabled? (will probably produce a
> >>>>>>>> lot of output)
> >>>>>>>>
> >>>>>>>> echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
> >>>>>>>
> >>>>>>> I'll try, sure.
> >>>>>>
> >>>>>> I used tracing otherwise the problem wouldn't show up. Attached you can
> >>>>>> find output:
> >>>>>>
> >>>>>> 0b7e070de7b65de9f70805f4639b3e58  xhci-timeout-testusb.txt.gz
> >>>>>>
> >>>>>
> >>>>> Thanks, looks like we end up calling cleanup_halted_endpoint()  a lot.
> >>>>> This will (try to) reset the endpoint and move to handle the next TD (URB).
> >>>>>
> >>>>> This is called when we're processing contorl transfers and something out of the ordinary happends (returned STALL, BABBLE, and some other reasons)
> >>>>>
> >>>>> I need to dig a bit deeper to know what actually is going on. 
> >>>>
> >>>> any news here ? It's been almost a month.
> >>>>
> >>>
> >>> While looking at this and other bugs I found races between reset endpoint, reset device, and set dequeue pointer commands. 
> >>> I suspect the loop in your logs is due to starting the endpoint ring too early after reset. It restarts before we move
> >>> past the problematic TD, and start executing it again.
> >>>
> >>> The logs don't show why the TD fails in the first place, but I got another patch fixing other race issues which might help.
> >>>
> >>> Both patches are now in a "reset-rework" topic branch at:
> >>>
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git reset-rework
> >>>
> >>> Its based on 3.18-rc2.
> >>> I haven't still got or set up a usb device with gadget zero to test it out myself
> >>
> >> I'll try to run it today or tomorrow.
> > 
> > seems to be working so far. It has been running for at least a couple
> > hours. I'll leave it running until Monday or Tuesday before giving you a
> > Tested-by, though.
> > 
> 
> Thanks, much appreciated. 
> Sounds promising so far, hope it lasts over the weekend

Alright, it has been running for almost 4 days and failures so far:

[1]+ ./test.sh &
# uptime 
 15:20:15 up 3 days, 20:08,  1 user,  load average: 1.63, 1.84, 1.86

So, for both commits on reset-rework (see below), you can have my:

Tested-by: Felipe Balbi <balbi@xxxxxx>

8<---------------------------------------------------------------------------

commit d0a6a31173ccc49625c3e6e9e8a86e4a6e885ea6
Author: Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx>
Date:   Thu Nov 6 18:06:09 2014 +0200

    xhci: don't start a halted endpoint before its new dequeue is set

    A halted endpoint ring must first be reset, then move the ring
    dequeue pointer past the problematic TRB. If we start the ring too
    early after reset, but before moving the dequeue pointer we
    will end up executing the same problematic TRB again.

    As we always issue a set tranfer dequeue command after a reset
    endpoint command we can skip starting endpoint rings at reset endpoint
    command completion.

    Signed-off-by: Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx>

commit 117e8a66b86651111a78a1eef1344c05901b1d94
Author: Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx>
Date:   Tue Nov 4 19:00:47 2014 +0200

    xhci: Don't set freed transfer dq pointers after device reset

    If an endpoint halts, or we want to cancel a URB, we need to modify
    TRBs on the endpoint ring. Before editing the trbs we need to stop
    the ring, and after editing, then restart the ring from a working place.
    Usually the next TD after the failed or cancelled one.

    If a reset device command is queued in the middle of this it will
    disable all endpoints and and free the rings. The "Set transfer dequeue"
    command queued after device reset will try to set invalid pointers.

    To avoid this we check if a unhandled device reset is queued before
    queuing a "set transfer dequeue" command, and make sure we clear
    endpoint state and stopped_td pointers after device reset.

    Also give back all the cancelled URBs from the old ring after
    device reset.

    Signed-off-by: Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx>

-- 
balbi
Attachment:
signature.asc

Description: Digital signature