On Fri, Mar 04, 2022 at 11:48:20PM +0000, Thinh Nguyen wrote:
Michael Grzeschik wrote:On Fri, Mar 04, 2022 at 01:43:00AM +0000, Thinh Nguyen wrote:Michael Grzeschik wrote:On Wed, Mar 02, 2022 at 10:22:55PM +0000, Thinh Nguyen wrote:Michael Grzeschik wrote:I am currently debugging a strange error with transferring isoc requests with the current mainline driver. When the pending and started list is empty the complete interrupt handler is sending an End Transfer to the controller. This End Transfer will then get successfully completed. irq/50-dwc3-305 [000] d..1. 166.661845: dwc3_gadget_ep_cmd: ep5in: cmd 'End Transfer' [b0d08] params 00000000 00000000 00000000 --> status: Successful irq/50-dwc3-305 [000] d..1. 166.661877: dwc3_event: event (080b01d6): ep5in: Endpoint Command Complete irq/50-dwc3-305 [000] d..1. 166.661979: dwc3_event: event (482a10d6): ep5in: Transfer Not Ready [0000482a] (Not Active) After this the Transfer needs to start again. But unfortunately this sometimes does not succeed. You see in the following trace that the Start of the just prepared trbs did complete with an -18. kworker/u5:2-347 [001] d..2. 166.691288: dwc3_prepare_trb: ep5in: trb 000000007ff95a12 (E233:D232) buf 0000ffffffb1c2a8 size 1x 12 ctrl 00000465:00000000 (HlCs:Sc:isoc-first) kworker/u5:2-347 [001] d..2. 166.691293: dwc3_prepare_trb: ep5in: trb 00000000f9cd175e (E234:D232) buf 0000ffffffb1d000 size 1x 7156 ctrl 00000471:00000000 (Hlcs:Sc:isoc)[snip]The intermediate Bus Expiry is not always present in that case, but also completely acceptable as we loop over the Start Transfer until this succeeds. Do you have any Idea why this succeeded Start Streaming does come back with an Missed Transfer on the first trb?Hi Michael, When sending the trace, please leave it unfiltered. It's missing some trace events that I want to check.Right, it was a mix of trace_event:dwc3:* and ftrace_filter=dwc*. So I kind of felt filtering out the interesting part would help. Anyway. Here is the same situation in another trace, with all trace_events from dwc3:<snip>Please check if the host is actually polling for data at that uframe. Each TRB is scheduled for a particular interval. If the host doesn't request for data at that particular interval, then it's a missed isoc also (err -18). So, if you have a usb traffic analyzer, please verify this. Also, maybe try with a different host vendor to see if the behavior is the same.The behavior seems to differ if I attach different hosts. I have no traffic analyzer. Would usbmon under linux host work for that? How wouldusbmon doesn't see usb packet level. So no, usbmon doesn't work.I synchronize the gadget request and host urbs to know which frame interval would have been the one that was missed in my logs?It will be difficult. You'd have to check the expected interval the urb was scheduled for from xhci driver, and you may have to check the MFINDEX. You'd also need to determine a point of reference that the data received from device is for a particular interval. Even if you have all this info, you'd have to assume everything the host controller reported is accurate and not missing anything.Right. But as I have no analyzer, this option is out of scope for now. And as you further describe the likelihood for the host being the cause is also very low. I also was looking into another direction and was suspecting the frame_length not being properly configured. In my system the ref_clk is 100MHz so the period is 10ns. That is also what is configured in dts with snps,ref-clock-period-ns = <10>; The default for xilinx is <50> so 25MHz. It seemed that changing this had completely no effect on the transfers. I had expected that this would change at least the stability. The fladj is also 0 in all cases. Do I miss something here?The controller turns off some clocks and uses ref_clk to track the current uframe when it's in low power. If the device is not in low power, then there's no need for it to use the ref_clk to track uframe. The UVC is running isoc at 1 uframe interval, the device definitely won't go in low power at this rate. Also, I'm not sure if your setup is configured to take advantage of that feature either.
I see.
Is there any way the gadget driver could see if for that interval the host did not poll for data? OR are there bits to get this behaviour predictable and the driver to react on?You can check if there any "missed service error" from host, it may give some clues, but not the whole picture.Alright, I will check for that option.Otherwise, why should on an ongoing stream the host not poll for data in any interval. What would be the constraints to ignore one interval.The host driver may not intentionally does it. There are many reasons where the host controller not poll for data, and I've seen it often. It can be because of the driver handling interrupt latency/delay that prevents the driver from feeding isoc TD/TRBs to the controller in time, causing underflow and starve the controller. Some other time it's the controller that decided the bandwidth is not sufficient for isoc transfer or phy signal instability and cause a missed service error etc.I also think latency is the point here. I had a patch laying around that was counting the queued requests but only kicked the start transfer after one defined threshold was reached. This seemed to help with the missed interrupts. But unfortunately this approach introduced other issues. For being absolute save I had to put this threshold very high. But when in rare cases the Frame that was ment to be enqueued was very small the full amount of data could reach under that threshold and stall the whole pipeline, since this frame was then never kicked. I could sprinkle some timeout handling around it. To ensure that after a minimum time the last queue will definitely start. But this seemed to be a very hacky approach.Yeah, that doesn't sound right.Anyway, in this particular case, from my experience, I don't think it's the host that's missing a poll. It's probably because the device scheduling it too aggressively early. Can you try to schedule on the 2nd interval instead and see if it helps in your case. diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 3e75eaa13abc..b9ec7626b080 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -2104,7 +2104,7 @@ static int __dwc3_gadget_start_isoc(struct dwc3_ep *dep) } for (i = 0; i < DWC3_ISOC_MAX_RETRIES; i++) { - dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 1); + dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 2); ret = __dwc3_gadget_kick_transfer(dep); if (ret != -EAGAIN)I also had this patch applied. I just tested it jet with + 3 instead and it did stop showing the missed interrupts. Is this a valid patch for mainline? I had my doubts, so missed to send it yet.Not by itself like this. I just wanted to test this and confirm my suspicion, which it did. For a real patch, it should look something like this: diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 3e75eaa13abc..fb070a3237a8 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -2104,7 +2104,13 @@ static int __dwc3_gadget_start_isoc(struct dwc3_ep *dep) } for (i = 0; i < DWC3_ISOC_MAX_RETRIES; i++) { - dep->frame_number = DWC3_ALIGN_FRAME(dep, i + 1); + int future_interval = i + 1; + + /* Give the controller at least 500us to schedule transfer */ + if (desc->bInterval < 3) + future_interval += 3 - desc->bInterval; + + dep->frame_number = DWC3_ALIGN_FRAME(dep, future_interval); ret = __dwc3_gadget_kick_transfer(dep); if (ret != -EAGAIN)
That looks much better! Will you send this?
Also, there's another issue in the dwc3 driver. Currently it checks the event status and reported it to all the requests. I notice that you (or the device UVC) don't set interrupt for every request but about every 10th request? Not all of the completed requests failed with missed isoc, but the interrupt event status incorrectly set -18 for all of them. The dwc3 driver should check the TRB status for each individual TRB and return the proper status to the associated request.Oh! This is where the multiple -18 are coming from. I did not have time to look for that. But it totally makes sense.If you're up to creating a patch to fix it, I'll be glad to review it. If not, it'll be on my to-do list (which may take awhile).
Just did that: https://lore.kernel.org/linux-usb/20220307214639.4164547-1-m.grzeschik@xxxxxxxxxxxxxx/T/#u Thanks, Michael -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Attachment:
signature.asc
Description: PGP signature