Re: DWC3 Isoc Gadget Missed Interrupts on Start Transfer

Thinh Nguyen <Thinh.Nguyen@xxxxxxxxxxxx> · Fri, 4 Mar 2022 23:48:20 +0000

Michael Grzeschik wrote:
> On Fri, Mar 04, 2022 at 01:43:00AM +0000, Thinh Nguyen wrote:
>> Michael Grzeschik wrote:
>>> On Wed, Mar 02, 2022 at 10:22:55PM +0000, Thinh Nguyen wrote:
>>>> Michael Grzeschik wrote:
>>>>> I am currently debugging a strange error with transferring isoc
>>>>> requests
>>>>> with
>>>>> the current mainline driver.
>>>>>
>>>>> When the pending and started list is empty the complete interrupt
>>>>> handler is
>>>>> sending an End Transfer to the controller. This End Transfer will
>>>>> then get
>>>>> successfully completed.
>>>>>
>>>>>     irq/50-dwc3-305     [000] d..1.   166.661845: dwc3_gadget_ep_cmd:
>>>>> ep5in: cmd 'End Transfer' [b0d08] params 00000000 00000000 00000000
>>>>> -->
>>>>> status: Successful
>>>>>     irq/50-dwc3-305     [000] d..1.   166.661877: dwc3_event: event
>>>>> (080b01d6): ep5in: Endpoint Command Complete
>>>>>     irq/50-dwc3-305     [000] d..1.   166.661979: dwc3_event: event
>>>>> (482a10d6): ep5in: Transfer Not Ready [0000482a] (Not Active)
>>>>>
>>>>>
>>>>> After this the Transfer needs to start again. But unfortunately this
>>>>> sometimes
>>>>> does not succeed. You see in the following trace that the Start of
>>>>> the just
>>>>> prepared trbs did complete with an -18.
>>>>>
>>>>>    kworker/u5:2-347     [001] d..2.   166.691288: dwc3_prepare_trb:
>>>>> ep5in: trb 000000007ff95a12 (E233:D232) buf 0000ffffffb1c2a8 size
>>>>> 1x 12
>>>>> ctrl 00000465:00000000 (HlCs:Sc:isoc-first)
>>>>>    kworker/u5:2-347     [001] d..2.   166.691293: dwc3_prepare_trb:
>>>>> ep5in: trb 00000000f9cd175e (E234:D232) buf 0000ffffffb1d000 size 1x
>>>>> 7156 ctrl 00000471:00000000 (Hlcs:Sc:isoc)
>>>
>>> [snip]
>>>
>>>>>
>>>>> The intermediate Bus Expiry is not always present in that case, but
>>>>> also
>>>>> completely acceptable as we loop
>>>>> over the Start Transfer until this succeeds.
>>>>>
>>>>> Do you have any Idea why this succeeded Start Streaming does come back
>>>>> with an
>>>>> Missed Transfer on the first trb?
>>>>>
>>>>
>>>> Hi Michael,
>>>>
>>>> When sending the trace, please leave it unfiltered. It's missing some
>>>> trace events that I want to check.
>>>
>>> Right, it was a mix of trace_event:dwc3:* and ftrace_filter=dwc*. So I
>>> kind of felt filtering out the interesting part would help.
>>>
>>> Anyway. Here is the same situation in another trace, with all
>>> trace_events from dwc3:
>>>
>>
>> <snip>
>>
>>>
>>>> Please check if the host is actually polling for data at that uframe.
>>>> Each TRB is scheduled for a particular interval. If the host doesn't
>>>> request for data at that particular interval, then it's a missed isoc
>>>> also (err -18). So, if you have a usb traffic analyzer, please verify
>>>> this. Also, maybe try with a different host vendor to see if the
>>>> behavior is the same.
>>>
>>> The behavior seems to differ if I attach different hosts. I have no
>>> traffic analyzer. Would usbmon under linux host work for that? How would
>>
>> usbmon doesn't see usb packet level. So no, usbmon doesn't work.
>>
>>> I synchronize the gadget request and host urbs to know which frame
>>> interval would have been the one that was missed in my logs?
>>
>> It will be difficult. You'd have to check the expected interval the urb
>> was scheduled for from xhci driver, and you may have to check the
>> MFINDEX. You'd also need to determine a point of reference that the data
>> received from device is for a particular interval. Even if you have all
>> this info, you'd have to assume everything the host controller reported
>> is accurate and not missing anything.
> 
> Right. But as I have no analyzer, this option is out of scope for now.
> And as you further describe the likelihood for the host being the cause
> is also very low.
> 
> I also was looking into another direction and was suspecting the
> frame_length not being properly configured.
> 
> In my system the ref_clk is 100MHz so the period is 10ns. That is
> also what is configured in dts with snps,ref-clock-period-ns = <10>;
> 
> The default for xilinx is <50> so 25MHz. It seemed that changing this
> had completely no effect on the transfers. I had expected that this
> would change at least the stability.
> 
> The fladj is also 0 in all cases. Do I miss something here?

The controller turns off some clocks and uses ref_clk to track the current
uframe when it's in low power. If the device is not in low power, then there's
no need for it to use the ref_clk to track uframe. The UVC is running isoc at
1 uframe interval, the device definitely won't go in low power at this rate.

Also, I'm not sure if your setup is configured to take advantage of that feature
either.

> 
>>> Is there any way the gadget driver could see if for that interval the
>>> host did not poll for data? OR are there bits to get this behaviour
>>> predictable and the driver to react on?
>>>
>>
>> You can check if there any "missed service error" from host, it may give
>> some clues, but not the whole picture.
> 
> Alright, I will check for that option.
> 
>>> Otherwise, why should on an ongoing stream the host not poll
>>> for data in any interval. What would be the constraints to
>>> ignore one interval.
>>
>> The host driver may not intentionally does it. There are many reasons
>> where the host controller not poll for data, and I've seen it often. It
>> can be because of the driver handling interrupt latency/delay that
>> prevents the driver from feeding isoc TD/TRBs to the controller in time,
>> causing underflow and starve the controller. Some other time it's the
>> controller that decided the bandwidth is not sufficient for isoc
>> transfer or phy signal instability and cause a missed service error etc.
> 
> I also think latency is the point here. I had a patch laying around that
> was counting the queued requests but only kicked the start transfer
> after one defined threshold was reached. This seemed to help with the
> missed interrupts. But unfortunately this approach introduced other
> issues. For being absolute save I had to put this threshold very high.
> But when in rare cases the Frame that was ment to be enqueued was very
> small the full amount of data could reach under that threshold and
> stall the whole pipeline, since this frame was then never kicked. I
> could sprinkle some timeout handling around it. To ensure that after a
> minimum time the last queue will definitely start. But this seemed to be
> a very hacky approach.

Yeah, that doesn't sound right.

> 
>> Anyway, in this particular case, from my experience, I don't think it's
>> the host that's missing a poll. It's probably because the device
>> scheduling it too aggressively early. Can you try to schedule on the 2nd
>> interval instead and see if it helps in your case.
>>
>> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
>> index 3e75eaa13abc..b9ec7626b080 100644
>> --- a/drivers/usb/dwc3/gadget.c
>> +++ b/drivers/usb/dwc3/gadget.c
>> @@ -2104,7 +2104,7 @@ static int __dwc3_gadget_start_isoc(struct dwc3_ep
>> *dep)
>>        }
>>
>>        for (i = 0; i < DWC3_ISOC_MAX_RETRIES; i++) {
>> -               dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 1);
>> +               dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 2);
>>
>>                ret = __dwc3_gadget_kick_transfer(dep);
>>                if (ret != -EAGAIN)
> 
> I also had this patch applied. I just tested it jet with + 3 instead
> and it did stop showing the missed interrupts. Is this a valid patch
> for mainline? I had my doubts, so missed to send it yet.
> 

Not by itself like this. I just wanted to test this and confirm my
suspicion, which it did.

For a real patch, it should look something like this:

diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 3e75eaa13abc..fb070a3237a8 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -2104,7 +2104,13 @@ static int __dwc3_gadget_start_isoc(struct dwc3_ep *dep)
        }
 
        for (i = 0; i < DWC3_ISOC_MAX_RETRIES; i++) {
-               dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 1);
+               int future_interval = i + 1;
+
+               /* Give the controller at least 500us to schedule transfer */
+               if (desc->bInterval < 3)
+                       future_interval += 3 - desc->bInterval;
+
+               dep->frame_number = DWC3_ALIGN_FRAME(dep, future_interval);
 
                ret = __dwc3_gadget_kick_transfer(dep);
                if (ret != -EAGAIN)


>> Also, there's another issue in the dwc3 driver. Currently it checks the
>> event status and reported it to all the requests. I notice that you (or
>> the device UVC) don't set interrupt for every request but about every
>> 10th request? Not all of the completed requests failed with missed isoc,
>> but the interrupt event status incorrectly set -18 for all of them. The
>> dwc3 driver should check the TRB status for each individual TRB and
>> return the proper status to the associated request.
> 
> Oh! This is where the multiple -18 are coming from. I did not have time
> to look for that. But it totally makes sense.
> 

If you're up to creating a patch to fix it, I'll be glad to review it.
If not, it'll be on my to-do list (which may take awhile).

Thanks,
Thinh