Re: DWC3 Isoc Gadget Missed Interrupts on Start Transfer

Thinh Nguyen <Thinh.Nguyen@xxxxxxxxxxxx> · Fri, 4 Mar 2022 23:54:56 +0000

Michael Grzeschik wrote:
> On Fri, Mar 04, 2022 at 01:43:00AM +0000, Thinh Nguyen wrote:
>> Michael Grzeschik wrote:
>>> On Wed, Mar 02, 2022 at 10:22:55PM +0000, Thinh Nguyen wrote:
>>>> Michael Grzeschik wrote:
>>>>> I am currently debugging a strange error with transferring isoc
>>>>> requests
>>>>> with
>>>>> the current mainline driver.
>>>>>
>>>>> When the pending and started list is empty the complete interrupt
>>>>> handler is
>>>>> sending an End Transfer to the controller. This End Transfer will
>>>>> then get
>>>>> successfully completed.
>>>>>
>>>>>     irq/50-dwc3-305     [000] d..1.   166.661845: dwc3_gadget_ep_cmd:
>>>>> ep5in: cmd 'End Transfer' [b0d08] params 00000000 00000000 00000000
>>>>> -->
>>>>> status: Successful
>>>>>     irq/50-dwc3-305     [000] d..1.   166.661877: dwc3_event: event
>>>>> (080b01d6): ep5in: Endpoint Command Complete
>>>>>     irq/50-dwc3-305     [000] d..1.   166.661979: dwc3_event: event
>>>>> (482a10d6): ep5in: Transfer Not Ready [0000482a] (Not Active)
>>>>>
>>>>>
>>>>> After this the Transfer needs to start again. But unfortunately this
>>>>> sometimes
>>>>> does not succeed. You see in the following trace that the Start of
>>>>> the just
>>>>> prepared trbs did complete with an -18.
>>>>>
>>>>>    kworker/u5:2-347     [001] d..2.   166.691288: dwc3_prepare_trb:
>>>>> ep5in: trb 000000007ff95a12 (E233:D232) buf 0000ffffffb1c2a8 size
>>>>> 1x 12
>>>>> ctrl 00000465:00000000 (HlCs:Sc:isoc-first)
>>>>>    kworker/u5:2-347     [001] d..2.   166.691293: dwc3_prepare_trb:
>>>>> ep5in: trb 00000000f9cd175e (E234:D232) buf 0000ffffffb1d000 size 1x
>>>>> 7156 ctrl 00000471:00000000 (Hlcs:Sc:isoc)
>>>
>>> [snip]
>>>
>>>>>
>>>>> The intermediate Bus Expiry is not always present in that case, but
>>>>> also
>>>>> completely acceptable as we loop
>>>>> over the Start Transfer until this succeeds.
>>>>>
>>>>> Do you have any Idea why this succeeded Start Streaming does come back
>>>>> with an
>>>>> Missed Transfer on the first trb?
>>>>>
>>>>
>>>> Hi Michael,
>>>>
>>>> When sending the trace, please leave it unfiltered. It's missing some
>>>> trace events that I want to check.
>>>
>>> Right, it was a mix of trace_event:dwc3:* and ftrace_filter=dwc*. So I
>>> kind of felt filtering out the interesting part would help.
>>>
>>> Anyway. Here is the same situation in another trace, with all
>>> trace_events from dwc3:
>>>
>>
>> <snip>
>>
>>>
>>>> Please check if the host is actually polling for data at that uframe.
>>>> Each TRB is scheduled for a particular interval. If the host doesn't
>>>> request for data at that particular interval, then it's a missed isoc
>>>> also (err -18). So, if you have a usb traffic analyzer, please verify
>>>> this. Also, maybe try with a different host vendor to see if the
>>>> behavior is the same.
>>>
>>> The behavior seems to differ if I attach different hosts. I have no
>>> traffic analyzer. Would usbmon under linux host work for that? How would
>>
>> usbmon doesn't see usb packet level. So no, usbmon doesn't work.
>>
>>> I synchronize the gadget request and host urbs to know which frame
>>> interval would have been the one that was missed in my logs?
>>
>> It will be difficult. You'd have to check the expected interval the urb
>> was scheduled for from xhci driver, and you may have to check the
>> MFINDEX. You'd also need to determine a point of reference that the data
>> received from device is for a particular interval. Even if you have all
>> this info, you'd have to assume everything the host controller reported
>> is accurate and not missing anything.
>>
>>>
>>> Is there any way the gadget driver could see if for that interval the
>>> host did not poll for data? OR are there bits to get this behaviour
>>> predictable and the driver to react on?
>>>
>>
>> You can check if there any "missed service error" from host, it may give
>> some clues, but not the whole picture.
>>
>>> Otherwise, why should on an ongoing stream the host not poll
>>> for data in any interval. What would be the constraints to
>>> ignore one interval.
>>
>> The host driver may not intentionally does it. There are many reasons
>> where the host controller not poll for data, and I've seen it often. It
>> can be because of the driver handling interrupt latency/delay that
>> prevents the driver from feeding isoc TD/TRBs to the controller in time,
>> causing underflow and starve the controller. Some other time it's the
>> controller that decided the bandwidth is not sufficient for isoc
>> transfer or phy signal instability and cause a missed service error etc.
>>
>> Anyway, in this particular case, from my experience, I don't think it's
>> the host that's missing a poll. It's probably because the device
>> scheduling it too aggressively early. Can you try to schedule on the 2nd
>> interval instead and see if it helps in your case.
>>
>> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
>> index 3e75eaa13abc..b9ec7626b080 100644
>> --- a/drivers/usb/dwc3/gadget.c
>> +++ b/drivers/usb/dwc3/gadget.c
>> @@ -2104,7 +2104,7 @@ static int __dwc3_gadget_start_isoc(struct dwc3_ep
>> *dep)
>>        }
>>
>>        for (i = 0; i < DWC3_ISOC_MAX_RETRIES; i++) {
>> -               dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 1);
>> +               dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 2);
>>
>>                ret = __dwc3_gadget_kick_transfer(dep);
>>                if (ret != -EAGAIN)
>>
>>
> 
> I think I found the case:
> 
> From my datasheet:
> 
> <quote>
>  After a transfer has been started, the hardware will perform the
> following functions for
>  IN endpoints:
> 
>  1. Fetch TX data as early as one interval prior to the beginning of the
> interval (call it A)
>     if the HWO bit is set to one in the TRB.
>  2. Decrement the buffer size of each TRB as packets are transmitted.
>  3. Retire TRBs when their buffer size has reached 0, issuing an
> XferInProgress event if
>     the IOC bit is set.
>  4. If the next interval (B) starts before all the packets have been
> transmitted for interval A:
>     a. Flush the TXFIFO.
>     b. Retire the Buffer Descriptor of interval A with a "Missed
> Isochronous" status.
>     c. Retire the Buffer Descriptor of interval B with a "Missed
> Isochronous" status.
>     d. See Checking interval status for a description of how software
> can determine that
>        an interval ended unexpectedly.
>     e. Go to step 1 to prepare for interval C
>  5. Otherwise, if all the TRBs of interval A are completed, the hardware
> will prepare for
>     interval B.
> </quote>
> 
> Could it be that in the observed case we just run directly into case 4
> after issueing Stream Start. Since the frame interval we programm
> is so close to the current running frame interval.

Yes, that's what the code snippet I sent to confirm the case was for.
Typically, the Start Transfer command status "bus expiry" will tell the
driver of this case, but that may not be enough.

> 
> So actually applying the below patch would totally make sense. We could
> even
> increase that to an higher number, just to make sure that we would not
> run into
> an partially started frame interval when issueing the Start Transfer.
> 

See my other email to make a patch.

Thanks,
Thinh