Re: More dwc3 gadget issues with adb

John Stultz <john.stultz@xxxxxxxxxx> · Wed, 22 Apr 2020 11:32:40 -0700

On Tue, Apr 21, 2020 at 11:00 PM Felipe Balbi <balbi@xxxxxxxxxx> wrote:
> John Stultz <john.stultz@xxxxxxxxxx> writes:
> > On Thu, Apr 16, 2020 at 1:19 AM Felipe Balbi <balbi@xxxxxxxxxx> wrote:
> >> One thing I noticed is that we're missing a giveback on ep1out. Here's a
> >> working case:
> >>
> >
> > Hey Felipe,
> >   So I found some time to dig around on this today and I started
> > trying to understand this issue that you've pointed out about missing
> > the giveback.
> >
> > It seems part of the issue is we get to a point where we have some req
> > where pending_sgs is more than one.
> >
> > We call dwc3_prepare_one_trb_sg() on it:
> >   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n1068
> >
> > And we process the sg list incrementing req->num_queued_sgs for each one.
> >
> > then later, dwc3_gadget_ep_cleanup_completed_request() is called on the request:
> >   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2522
> >
> > We call dwc3_gadget_ep_reclaim_trb_sg()
> >   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470
> >
> > Where we iterate over the req->sg, ideally decrementing
> > num_pending_sgs each time and return.
> >
> > But back in dwc3_gadget_ep_cleanup_completed_request()  and there
> > we're hitting the:
> >   if (!dwc3_gadget_ep_request_completed(req) ||
> >       req->num_pending_sgs) {
> > case which causes us to skip the call to dwc3_gadget_giveback().
>
> This logic was modified recently. Can you check if today's linus/master
> works for you?

I was testing this against 5.7-rc2, but I updated to linus/master and
I'm not seeing any change.

> > Looking as to why the num_pending_sgs is non zero, that's because in
> > dwc3_gadget_ep_reclaim_trb_sg we're hitting the case where the trb has
> > the DWC3_TRB_CTRL_HWO ctrl flag set, which breaks us out of the loop
> > early before we decrement num_pending_sgs.
> >
> > For that trb, we're setting the HWO flag in __dwc3_prepare_one_trb()
> > (called from dwc3_prepare_one_trb_sg() back at the beginning):
> >   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n921
> >
> > I added logic showing every time we set or clear that flag, and it
> > seems like we're always setting it but never clearing it. And often
>
> HW clears HWO flag. We only have to manually clear in one or two cases.

I guess the bit that worries me with the current code is in
dwc3_gadget_ep_reclaim_completed_trb():
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2406

There is the logic:
  if (chain && (trb->ctrl & DWC3_TRB_CTRL_HWO))
           trb->ctrl &= ~DWC3_TRB_CTRL_HWO;

But that will never trip, as the only time we call
dwc3_gadget_ep_reclaim_completed_trb() with chain==true is from
dwc3_gadget_ep_reclaim_trb_sg():
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/dwc3/gadget.c?h=v5.7-rc2#n2470

Where if the trb->ctrl has the HWO flag set, we break out before
calling  dwc3_gadget_ep_reclaim_completed_trb().

So the logic quoted above seems to be effectively dead code (despite
the big comment as to why we need it)

That's why I suspected the HWO check in
dwc3_gadget_ep_reclaim_trb_sg() is problematic.

> > that's not an issue as we only have one sg entry. But if its set on a
> > trb in a request with multiple sgs, that's where it seems to be
> > causing the issue.
>
> The issue is completing with HWO set, which should never happen. Can you
> collect tracepoints with linus/master of this particular situation?

Will do, though again, its a little tough as often we hit the stall
state pretty quickly at bootup, before I can even try to enable
tracing, so it may take a few tries.

> >> One interesting thing is that TRB addresses are "odd". I can't find a
> >> proper lifetime for these TRBs. Do you have IOMMU enabled? Can you run
> >> without it? For example, nowhere in the log can I find the place where
> >> trb 0000000092deef41 was first enqueue. I'm assuming the log to be
> >> ordered, which means that trb is the same as 00000000f3db4076. But why
> >> are the addresses different?
> >>
> >> Another weird thing is that even though we CHN bit being set in
> >> 0000000092deef41, we don't see where the second trb (the one its chained
> >> to) was prepared. It seems like it was *never* prepared, what gives?
> >
> > I suspect these bits were due to the tracing happening after some
> > minor amount of initial adb traffic began at bootup? So the trace
> > isn't capturing all the events.
>
> No, no. That's more likely to be IOMMU mucking up the addresses. ADB is
> very sequential in its behavior and USB transfers requests in
> order. Please run linus/master without IOMMU.

So I didn't have any IOMMU support enabled in the config I was testing
with. I went through and added IOMMU options in my config with
linus/master as well and that didn't seem to change the behavior
either.

I'll get back to you with further trace logs.

thanks
-john