Re: Control message failures kill entire XHCI stack

Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> · Thu, 05 Mar 2015 17:25:20 +0200

On 04.03.2015 15:27, Alistair Grant wrote:
> Hi Mathias,
> 
> 
> On Tue, Mar 3, 2015 at 8:40 PM, Alistair Grant <akgrant0710@xxxxxxxxx> wrote:
>> Hi Mathias,
>>
>> On Tue, Mar 3, 2015 at 2:21 PM, Mathias Nyman
>> <mathias.nyman@xxxxxxxxxxxxxxx> wrote:
>>> On 28.02.2015 09:16, Alistair Grant wrote:
>>>> ...
>>>> * 3.19.0 with the following patches:
>>>> * xhci: Allocate correct amount of scratchpad buffers
>>>> * xhci: Don't touch TRBs memory if those are no longer on the endpoint ring
>>>> * xhci: fix invalid pointer in reset device debugging
>>>> * xhci: add debugging for reset device and stop endpoint commands
>>>> * xhci: add command ring stop and restart debug messages
>>>>
>>>
>>> Does increasing the TRB count per segment help?
>>
>> Success!
>>
>> Increasing TRBS_PER_SEGMENT from 64 to 256 allowed me to successfully
>> record two 30 second segments of video, i.e. start recording with
>> mythffmpeg, Ctrl-C after 30 seconds, then repeat (this is on top of the
>> patched kernel I reported in my last message).
>>
>> This obviously is good news, it is also better than I typically saw using
>> the ehci driver, as often the second attempt would fail with a "Device or
>> Resource Busy" message (of course a single test is hardly conclusive, and
>> it may still appear).
>>
>> It's getting a bit late here, so hopefully tomorrow I'll try recording for
>> a longer period of time to make sure that succeeds as well.
>>
>> Included below is the syslog from the time I plugged the Live2 in to
>> unplugging it after recording.  There are three types of messages which
>> don't look completely normal to me:
> 
> I was able to record video for 1 hour today, and then stop and start
> recording another 3 times - just a few seconds each, this was more
> about ensuring it could stop and start multiple times.
> 
> I assume that this is a workaround, and that the core problem of ring
> expansion & cancelled URBs is still to be resolved.  Let me know if
> you would like that tested when it is ready.
> 

Hi 

yes, this is a workaround.

The latest theory for the cause is that we fill up the event ring. This
would be possible because we pick events from the event ring only on interrupt.

Isoc transfers are set to only interrupt at the last TD, with several isoc transfer
going on simultaneously, and especially with isoc transfer containing so many TDs we need
to increase the transfer ring (hence the ring expansion befora failure in the log) we fill
up the event ring and won't receive any stop endpoint event -> timeout -> kill HC 

Increasing the TRBS_PER_SEGMENTS helps as it also increases the event ring.

If you have time could you try forcing interrupts on every isoc TRB with the following change:

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 151484e..dfad305 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -539,6 +539,8 @@ int xhci_init(struct usb_hcd *hcd)
                xhci_dbg_trace(xhci, trace_xhci_dbg_init,
                                "xHCI doesn't need link TRB QUIRK");
        }
+       xhci->quirks |= XHCI_AVOID_BEI;
+
        retval = xhci_mem_init(xhci, GFP_KERNEL);
        xhci_dbg_trace(xhci, trace_xhci_dbg_init, "Finished xhci_init

with the old TRBS_PER_SEGMENT size, and see if it helps

Thanks

-Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html