On Tue, Oct 18, 2022, Jeff Vanhoof wrote: > Hi Thinh, > > On Tue, Oct 18, 2022 at 10:35:30PM +0000, Thinh Nguyen wrote: > > On Tue, Oct 18, 2022, Jeffrey Vanhoof wrote: > > > Hi Thinh, > > > > > > On Tue, Oct 18, 2022 at 06:45:40PM +0000, Thinh Nguyen wrote: > > > > Hi Dan, > > > > > > > > On Mon, Oct 17, 2022, Dan Vacura wrote: > > > > > Hi Thinh, > > > > > > > > > > On Mon, Oct 17, 2022 at 09:30:38PM +0000, Thinh Nguyen wrote: > > > > > > On Mon, Oct 17, 2022, Dan Vacura wrote: > > > > > > > From: Jeff Vanhoof <qjv001@xxxxxxxxxxxx> > > > > > > > > > > > > > > arm-smmu related crashes seen after a Missed ISOC interrupt when > > > > > > > no_interrupt=1 is used. This can happen if the hardware is still using > > > > > > > the data associated with a TRB after the usb_request's ->complete call > > > > > > > has been made. Instead of immediately releasing a request when a Missed > > > > > > > ISOC interrupt has occurred, this change will add logic to cancel the > > > > > > > request instead where it will eventually be released when the > > > > > > > END_TRANSFER command has completed. This logic is similar to some of the > > > > > > > cleanup done in dwc3_gadget_ep_dequeue. > > > > > > > > > > > > This doesn't sound right. How did you determine that the hardware is > > > > > > still using the data associated with the TRB? Did you check the TRB's > > > > > > HWO bit? > > > > > > > > > > The problem we're seeing was mentioned in the summary of this patch > > > > > series, issue #1. Basically, with the following patch > > > > > https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-usb/patch/20210628155311.16762-6-m.grzeschik@xxxxxxxxxxxxxx/__;!!A4F2R9G_pg!aSNZ-IjMcPgL47A4NR5qp9qhVlP91UGTuCxej5NRTv8-FmTrMkKK7CjNToQQVEgtpqbKzLU2HXET9O226AEN$ > > > > > integrated a smmu panic is occurring on our Android device with the 5.15 > > > > > kernel which is: > > > > > > > > > > <3>[ 718.314900][ T803] arm-smmu 15000000.apps-smmu: Unhandled arm-smmu context fault from a600000.dwc3! > > > > > > > > > > The uvc gadget driver appears to be the first (and only) gadget that > > > > > uses the no_interrupt=1 logic, so this seems to be a new condition for > > > > > the dwc3 driver. In our configuration, we have up to 64 requests and the > > > > > no_interrupt=1 for up to 15 requests. The list size of dep->started_list > > > > > would get up to that amount when looping through to cleanup the > > > > > completed requests. From testing and debugging the smmu panic occurs > > > > > when a -EXDEV status shows up and right after > > > > > dwc3_gadget_ep_cleanup_completed_request() was visited. The conclusion > > > > > we had was the requests were getting returned to the gadget too early. > > > > > > > > As I mentioned, if the status is updated to missed isoc, that means that > > > > the controller returned ownership of the TRB to the driver. At least for > > > > the particular request with -EXDEV, its TRBs are completed. I'm not > > > > clear on your conclusion. > > > > > > > > Do we know where did the crash occur? Is it from dwc3 driver or from uvc > > > > driver, and at what line? It'd great if we can see the driver log. > > > > > > > > > > To interject, what should happen in dwc3_gadget_ep_reclaim_completed_trb if the > > > IOC bit is not set (but the IMI bit is) and -EXDEV status is passed into it? > > > > Hm... we may have overlooked this case for no_interrupt scenario. If IMI > > is set, then there will be an interrupt when there's missed isoc > > regardless of whether no_interrupt is set by the gadget driver. > > > > > If the function returns 0, another attempt to reclaim may occur. If this > > > happens and the next request did have the HWO bit set, the function would > > > return 1 but dwc3_gadget_ep_cleanup_completed_request would still call > > > dwc3_gadget_giveback. > > > > > > As a test (without this patch), I added a check to see if HWO bit was set in > > > dwc3_gadget_ep_cleanup_completed_requests(). If the usecase was ISOC and the > > > HWO bit was set I avoided calling dwc3_gadget_ep_cleanup_completed_request(). > > > This seemed to also avoid the iommu related crash being seen. > > > > > > Is there an issue in this area that needs to be corrected instead? Not having > > > interrupts set for each request may be causing some new issues to be uncovered. > > > > > > As far as the crash seen without this patch, no good stacktrace is given. Line > > > provided for crash varied a bit, but tended to appear towards the end of > > > dwc3_stop_active_transfer() or dwc3_gadget_endpoint_trbs_complete(). > > > > > > Since dwc3_gadget_endpoint_trbs_complete() can be called from multiple > > > locations, I duplicated the function to help identify which path it was likely > > > being called from. At the time of the crashes seen, > > > dwc3_gadget_endpoint_transfer_in_progress() appeared to be the caller. > > > > > > dwc3_gadget_endpoint_transfer_in_progress() > > > ->dwc3_gadget_endpoint_trbs_complete() (crashed towards end of here) > > > ->dwc3_stop_active_transfer() (sometimes crashed towards end of here) > > > > > > I hope this clarifies things a bit. > > > > > > > Can we try this? Let me know if it resolves your issue. > > > > diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c > > index 61fba2b7389b..8352f4b5dd9f 100644 > > --- a/drivers/usb/dwc3/gadget.c > > +++ b/drivers/usb/dwc3/gadget.c > > @@ -3657,6 +3657,10 @@ static int dwc3_gadget_ep_reclaim_completed_trb(struct dwc3_ep *dep, > > if (event->status & DEPEVT_STATUS_SHORT && !chain) > > return 1; > > > > + if (usb_endpoint_xfer_isoc(dep->endpoint.desc) && > > + (event->status & DEPEVT_STATUS_MISSED_ISOC) && !chain) > > + return 1; > > + > > if ((trb->ctrl & DWC3_TRB_CTRL_IOC) || > > (trb->ctrl & DWC3_TRB_CTRL_LST)) > > return 1; > > > > With this change it doesn't seem to crash but unfortunately the output > completely hangs after the first missed isoc. At the moment I do not understand > why this might happen. > Can you capture the driver tracepoints with the change above? > > Note that I haven't quite learned correctly how to reply correct to the mailing > list. I appologize for messing up the thread a bit. > Seems fine to me. As long as I can read and understand, I've no issue. :) Thanks, Thinh