Re: DWC3 Isoc Gadget Missed Interrupts on Start Transfer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Mar 04, 2022 at 11:48:20PM +0000, Thinh Nguyen wrote:
Michael Grzeschik wrote:
On Fri, Mar 04, 2022 at 01:43:00AM +0000, Thinh Nguyen wrote:
Michael Grzeschik wrote:
On Wed, Mar 02, 2022 at 10:22:55PM +0000, Thinh Nguyen wrote:
Michael Grzeschik wrote:
I am currently debugging a strange error with transferring isoc
requests
with
the current mainline driver.

When the pending and started list is empty the complete interrupt
handler is
sending an End Transfer to the controller. This End Transfer will
then get
successfully completed.

    irq/50-dwc3-305     [000] d..1.   166.661845: dwc3_gadget_ep_cmd:
ep5in: cmd 'End Transfer' [b0d08] params 00000000 00000000 00000000
-->
status: Successful
    irq/50-dwc3-305     [000] d..1.   166.661877: dwc3_event: event
(080b01d6): ep5in: Endpoint Command Complete
    irq/50-dwc3-305     [000] d..1.   166.661979: dwc3_event: event
(482a10d6): ep5in: Transfer Not Ready [0000482a] (Not Active)


After this the Transfer needs to start again. But unfortunately this
sometimes
does not succeed. You see in the following trace that the Start of
the just
prepared trbs did complete with an -18.

   kworker/u5:2-347     [001] d..2.   166.691288: dwc3_prepare_trb:
ep5in: trb 000000007ff95a12 (E233:D232) buf 0000ffffffb1c2a8 size
1x 12
ctrl 00000465:00000000 (HlCs:Sc:isoc-first)
   kworker/u5:2-347     [001] d..2.   166.691293: dwc3_prepare_trb:
ep5in: trb 00000000f9cd175e (E234:D232) buf 0000ffffffb1d000 size 1x
7156 ctrl 00000471:00000000 (Hlcs:Sc:isoc)

[snip]


The intermediate Bus Expiry is not always present in that case, but
also
completely acceptable as we loop
over the Start Transfer until this succeeds.

Do you have any Idea why this succeeded Start Streaming does come back
with an
Missed Transfer on the first trb?


Hi Michael,

When sending the trace, please leave it unfiltered. It's missing some
trace events that I want to check.

Right, it was a mix of trace_event:dwc3:* and ftrace_filter=dwc*. So I
kind of felt filtering out the interesting part would help.

Anyway. Here is the same situation in another trace, with all
trace_events from dwc3:


<snip>


Please check if the host is actually polling for data at that uframe.
Each TRB is scheduled for a particular interval. If the host doesn't
request for data at that particular interval, then it's a missed isoc
also (err -18). So, if you have a usb traffic analyzer, please verify
this. Also, maybe try with a different host vendor to see if the
behavior is the same.

The behavior seems to differ if I attach different hosts. I have no
traffic analyzer. Would usbmon under linux host work for that? How would

usbmon doesn't see usb packet level. So no, usbmon doesn't work.

I synchronize the gadget request and host urbs to know which frame
interval would have been the one that was missed in my logs?

It will be difficult. You'd have to check the expected interval the urb
was scheduled for from xhci driver, and you may have to check the
MFINDEX. You'd also need to determine a point of reference that the data
received from device is for a particular interval. Even if you have all
this info, you'd have to assume everything the host controller reported
is accurate and not missing anything.

Right. But as I have no analyzer, this option is out of scope for now.
And as you further describe the likelihood for the host being the cause
is also very low.

I also was looking into another direction and was suspecting the
frame_length not being properly configured.

In my system the ref_clk is 100MHz so the period is 10ns. That is
also what is configured in dts with snps,ref-clock-period-ns = <10>;

The default for xilinx is <50> so 25MHz. It seemed that changing this
had completely no effect on the transfers. I had expected that this
would change at least the stability.

The fladj is also 0 in all cases. Do I miss something here?

The controller turns off some clocks and uses ref_clk to track the current
uframe when it's in low power. If the device is not in low power, then there's
no need for it to use the ref_clk to track uframe. The UVC is running isoc at
1 uframe interval, the device definitely won't go in low power at this rate.

Also, I'm not sure if your setup is configured to take advantage of that feature
either.


I see.


Is there any way the gadget driver could see if for that interval the
host did not poll for data? OR are there bits to get this behaviour
predictable and the driver to react on?


You can check if there any "missed service error" from host, it may give
some clues, but not the whole picture.

Alright, I will check for that option.

Otherwise, why should on an ongoing stream the host not poll
for data in any interval. What would be the constraints to
ignore one interval.

The host driver may not intentionally does it. There are many reasons
where the host controller not poll for data, and I've seen it often. It
can be because of the driver handling interrupt latency/delay that
prevents the driver from feeding isoc TD/TRBs to the controller in time,
causing underflow and starve the controller. Some other time it's the
controller that decided the bandwidth is not sufficient for isoc
transfer or phy signal instability and cause a missed service error etc.

I also think latency is the point here. I had a patch laying around that
was counting the queued requests but only kicked the start transfer
after one defined threshold was reached. This seemed to help with the
missed interrupts. But unfortunately this approach introduced other
issues. For being absolute save I had to put this threshold very high.
But when in rare cases the Frame that was ment to be enqueued was very
small the full amount of data could reach under that threshold and
stall the whole pipeline, since this frame was then never kicked. I
could sprinkle some timeout handling around it. To ensure that after a
minimum time the last queue will definitely start. But this seemed to be
a very hacky approach.

Yeah, that doesn't sound right.


Anyway, in this particular case, from my experience, I don't think it's
the host that's missing a poll. It's probably because the device
scheduling it too aggressively early. Can you try to schedule on the 2nd
interval instead and see if it helps in your case.

diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 3e75eaa13abc..b9ec7626b080 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -2104,7 +2104,7 @@ static int __dwc3_gadget_start_isoc(struct dwc3_ep
*dep)
       }

       for (i = 0; i < DWC3_ISOC_MAX_RETRIES; i++) {
-               dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 1);
+               dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 2);

               ret = __dwc3_gadget_kick_transfer(dep);
               if (ret != -EAGAIN)

I also had this patch applied. I just tested it jet with + 3 instead
and it did stop showing the missed interrupts. Is this a valid patch
for mainline? I had my doubts, so missed to send it yet.


Not by itself like this. I just wanted to test this and confirm my
suspicion, which it did.

For a real patch, it should look something like this:

diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 3e75eaa13abc..fb070a3237a8 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -2104,7 +2104,13 @@ static int __dwc3_gadget_start_isoc(struct dwc3_ep *dep)
       }

       for (i = 0; i < DWC3_ISOC_MAX_RETRIES; i++) {
-               dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 1);
+               int future_interval = i + 1;
+
+               /* Give the controller at least 500us to schedule transfer */
+               if (desc->bInterval < 3)
+                       future_interval += 3 - desc->bInterval;
+
+               dep->frame_number = DWC3_ALIGN_FRAME(dep, future_interval);

               ret = __dwc3_gadget_kick_transfer(dep);
               if (ret != -EAGAIN)


That looks much better! Will you send this?

Also, there's another issue in the dwc3 driver. Currently it checks the
event status and reported it to all the requests. I notice that you (or
the device UVC) don't set interrupt for every request but about every
10th request? Not all of the completed requests failed with missed isoc,
but the interrupt event status incorrectly set -18 for all of them. The
dwc3 driver should check the TRB status for each individual TRB and
return the proper status to the associated request.

Oh! This is where the multiple -18 are coming from. I did not have time
to look for that. But it totally makes sense.


If you're up to creating a patch to fix it, I'll be glad to review it.
If not, it'll be on my to-do list (which may take awhile).

Just did that:

https://lore.kernel.org/linux-usb/20220307214639.4164547-1-m.grzeschik@xxxxxxxxxxxxxx/T/#u

Thanks,
Michael

--
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux