Re: Mass Storage Gadget Device Falls from SuperSpeed to High Speed

Rob Weber <rob@xxxxxxxxxxx> · Wed, 27 Mar 2019 22:26:44 -0700

Hello,

On Wed, Mar 27, 2019 at 11:02:28AM +0200, Felipe Balbi wrote:
> >> >> > dmesg output is short (4 lines) and is probably helpful to find the
> >> >> > meaningful timestamps of the dwc3 traces. The sequence of events on the
> >> >> > device seems to start around 1346 seconds and ends around 1436 seconds.
> >> >> 
> >> >> right, from device perspective I see things behaving normally, then a
> >> >> "random" reset shows up.
> >> >
> >> > In what way does this show up? Does the host send a reset command, does
> >> > the device request the host port to reset, or does the device appear to
> >> > reset itself?
> >> 
> >> this IRQ is generated based on Reset Signaling driven on the bus by the
> >> host.
> >> 
> >> >> > Please note the "dwc3.1.auto: failed to enable ep0out" errors that are
> >> >> > occurring when we connect our device to a host. These errors occur
> >> >> > pretty regularly and we are not sure what is causing them.
> >> >> 
> >> >> this shouldn't happen, actually; where do you see those?
> >> >
> >> > They are in the dmesg output on the device. I was assuming they line up
> >> 
> >> huh? Here's what I have on device side dmesg:
> >> 
> >> [ 1345.662238] dwc3 dwc3.1.auto: request ffff880177cc4e40 was not queued to ep1out
> >> [ 1346.581055] g_mass_storage gadget: super-speed config #1: Linux File-Backed Storage
> >> [ 1435.982257] dwc3 dwc3.1.auto: request ffff880177cc4d80 was not queued to ep1out
> >> [ 1436.264087] g_mass_storage gadget: high-speed config #1: Linux File-Backed Storage
> >> 
> >> That's the entire thing.
> >> 
> >> > with tracepoints in the dwc traces, but I wasn't sure what to look for.
> >> > This error occurs approximately 75% of the time while the device is
> >> > enumerating.
> >> 
> >> odd. Really shouldn't happen.
> >
> > After digging through my notes, I realized I am confusing us with two
> > different errors my team has seen. Sorry about that. We experience the
> > "request not queued" error part of the time. We've also seen the
> > "failed to enable ep0out" error which was not observed during this
> > particular test case. I'll see if I can re-create it and gather logs.
> 
> Sure thing, thanks
> 
> >> >> > I would appreciate any help or guidance anyone could provide in
> >> >> > understanding more about this issue. We really want to understand why
> >> >> > we're transitioning from superspeed to high speed and whether or not
> >> >> > this issue is caused by software or hardware. I understand it might be
> >> >> > difficult for someone to help with a custom design and an old kernel,
> >> >> > but any pointers for further debugging would be incredibly helpful.
> >> >> > Thank you in advance for taking a look at all this data!
> >> >> 
> >> >> Sure thing, I'll ask Mathias (XHCI maintainer) for his help on this too
> >> >> since it's unclear what's causing the switch to HS when exitting U3.
> >> >
> >> > Thanks in advance for taking a look, Mathias! If there is any more
> >> > information you need from me, please let me know.
> >> 
> >> XHCI tracepoints (together with dynamic printk of xhci) would be
> >> great. Another thing to try is disabling LPM. It could be that your
> >> redriver is one of those in the set of "can't handle LPM" redrivers.
> >> 
> >> That's an easy thing to try. I suppose below is possibly enough:
> >> 
> >> modified   drivers/usb/host/xhci-pci.c
> >> @@ -154,7 +154,7 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
> >>  		xhci->quirks |= XHCI_U2_DISABLE_WAKE;
> >>  
> >>  	if (pdev->vendor == PCI_VENDOR_ID_INTEL) {
> >> -		xhci->quirks |= XHCI_LPM_SUPPORT;
> >> +		/* xhci->quirks |= XHCI_LPM_SUPPORT; */
> >>  		xhci->quirks |= XHCI_INTEL_HOST;
> >>  		xhci->quirks |= XHCI_AVOID_BEI;
> >>  	}
> >> 
> >
> > I compiled a new kernel with these changes applied to the xhci driver
> > and will test this tomorrow. This is my first time building and replacing
> > the kernel on my host system, so this has been quite an experience.
> > Now I have so many questions about the workflow of a kernel developer,
> > but that is worth a whole new thread. Anyways...
> 
> Just ask away :-) If you want to ask about this in private, no problems
> either.
> 
> > Is it possible to disable LPM support in the gadget? I was taking a look
> > in drivers/usb/gadget/composite.c and found some references to LPM in
> > the population of the ss_cap struct and the bmAttributes property. I
> > wasn't sure of the cleanest way to disable it, so I would appreciate a
> > pointer on where to make that change. It is generally faster for me to
> > iterate on the target device's kernel.
> 
> I'd have to verify with specification again, IIRC U1/U2 is mandatory for
> a USB SS device, but it's not mandatory for the host. Perhaps you could,
> for testing purposes, just stall U1 Enable and U2 Enable requests, but I
> don't know if that would work. Here's a patch for testing:
> 
> modified   drivers/usb/dwc3/ep0.c
> @@ -456,9 +456,11 @@ static int dwc3_ep0_handle_device(struct dwc3 *dwc,
>  	 * default control pipe
>  	 */
>  	case USB_DEVICE_U1_ENABLE:
> +		return -EINVAL;
>  		ret = dwc3_ep0_handle_u1(dwc, state, set);
>  		break;
>  	case USB_DEVICE_U2_ENABLE:
> +		return -EINVAL;
>  		ret = dwc3_ep0_handle_u2(dwc, state, set);
>  		break;
>  	case USB_DEVICE_LTM_ENABLE:
> 
> >> Felipe Balbi <felipe.balbi@xxxxxxxxxxxxxxx> writes:
> >> >>> Sure, that would be great. How can I download it? Any link which I can
> >> >>> curl? You can send it to me only, if you want; no problem.
> >> >>
> >> >> You (and anyone reading) can download it here:
> >> >> https://drive.google.com/file/d/1WSJ-222bguXTsRZ-mI5u2M7IP9FsZdj7/view?usp=sharing
> >> >
> >> > I'll download it and have a look, thanks
> >> 
> >> Alright, the LeCroy traces made everything a lot more clear. So here's
> >> what happens. A few assumptions first:
> >> 
> >> R is the right port where the Host is connected
> >> L is the left port where the device is connected
> >> 
> >> 
> >> We can start from LeCroy timestamp 61.271965376, that correlates with
> >> DWC3 tracepoint 1422.092840, iow:
> >> 
> >>      irq/23-dwc3-995   [000] d..1  1422.092840: dwc3_event: event (00120301): Link Change [U2]
> >> 
> >> From there, we see a change to Recovery.Active in timestamp
> >> 61.272120896. We don't see a link state change interrupt for this, so
> >> it's not reported here. We can see on LeCroy traces that host sends
> >> 372025 TS1 data packets for 12ms. From USB specification section
> >> 7.5.10.3.1 we read that once entering Recovery.Active a 12ms timer is
> >> started and (now on section 7.5.10.3.2) port should transition to
> >> SS.Inactive when this timer expires. Hence:
> >> 
> >>      irq/23-dwc3-995   [000] d..1  1427.800204: dwc3_event: event (00160301): Link Change [SS.Inactive]
> >> 
> >> From that point, on LeCroy timestamp 61.284317360, we see a transition
> >> to Rx.Detect which matches with:
> >> 
> >>      irq/23-dwc3-995   [000] d..1  1427.813205: dwc3_event: event (00150301): Link Change [RX.Detect]
> >> 
> >> From RX.Detect we see a series of transitions between Polling.LFPS and
> >> back to RX.Detect which eventually time out, meaning that RX.Detect
> >> failed and link transitions to SS.Disabled:
> >> 
> >>      irq/23-dwc3-995   [000] d..1  1427.911837: dwc3_event: event (00040301): Link Change [SS.Disabled]
> >> 
> >> From there, XHCI puts the SS link to U3 and, in order to do that it must
> >> go through RX.Detect, U0 and finally U3 (note that here U0 comes before
> >> RX.Detect, but I suppose that's some side effect of tracing since that's
> >> not a valid transition per figure 7-14):
> >> 
> >>      irq/23-dwc3-995   [000] d..1  1427.911849: dwc3_event: event (00000301): Link Change [U0]
> >>      irq/23-dwc3-995   [000] d..1  1427.914818: dwc3_event: event (00050301): Link Change [RX.Detect]
> >>      irq/23-dwc3-995   [000] d..1  1427.917746: dwc3_event: event (00030301): Link Change [U3]
> >> 
> >> Now, we get a reset which is actually an SE0 signal on the bus:
> >> 
> >>      irq/23-dwc3-995   [000] d..1  1435.982020: dwc3_event: event (00000101): Reset [U0]
> >> 
> >> For whatever reason that shows as Full Speed J on LeCroy traces. Go
> >> figure.
> >> 
> >> In any case, the problem here seems to be that device side was unable to
> >> transmit TS1 when exiting U2. Then it was, consequently, unable to
> >> perform RX.Detect and link was disabled and switched to full-speed and
> >> negotiated high-speed later on by means of chirp sequences.
> >> 
> >> I would really suggest running without LPM to get some more information
> >> since this seems to be rather clearly related to U2 exit. At the point
> >> of failure, it may also be a good idea to check the state of your PHYs
> >> it could be that the PHY got stuck in some weird state. Another thing to
> >> look at would be super speed eye diagram measurements, known errata for
> >> PHYs and redriver.
> >
> > Our team spent most of the day working with a high-speed signal
> > specialist to understand this issue more at the electrical level and we
> > have quite a few new findings.
> >
> > First, I'e attached Three screenshots of eye diagrams. The two images
> 
> that's an odd eye diagram. Usually, we have a picture which overlays
> several waveforms and produces something that looks like an eye. Like
> this image:
> 
> http://www.testusb.com/images/near%20end.png

I asked the high-speed signal specialist about this and he said that our
eye diagram does not properly resemble an eye because we are unable to
get a trigger lock while monitoring the SS signals. We do not have an
external trigger source connected to the oscilloscope, nor is the
oscilloscope acting as the USB host during these tests. So without a
proper mechanism for synchronizing the data with the scope, we won't be
able to get a proper eye diagram. We just have to infer based on the
dark and light patches visible in the scope.
> 
> > titled USB3_otg_rx0_data.jpg and USB3_otg_tx0_data1.jpg Are from our
> > production revision of our design. They show the differential input
> > and output on the SoC side of the redriver/mux (between SoC and
> > redriver/mux). Notice the "dashes" in the midle of the eye area. We've
> > observed these dashes on the Tx path between the redriver and the
> > connector as well but we do not have a screenshot. We are unable to
> > probe the Rx path between the redriver and the connector because the
> > signals are routed on inner layers of the board.
> 
> okay. BTW, is this eye diagram on the HS pair or SS pairs? Just to be
> clear, we should look at signal quality of the SS pairs, since the
> problem happens there. HS is behaving fine according to logs.

Eye diagrams are definitely on SS pairs.

> Electrical quality measurement with SS pairs is done by placing link in
> compliance mode. With dwc3 you can either force link to compliace via
> debugfs (see below) or you can have a scope with certification tools
> installed and that will move the link through a pre-specified pattern of
> link states and force the DUT to enter compliance.
> 
> To force dwc3 to compliance, here's what you do on device side as root:
> 
> # mkdir -p /d
> # mount -t debugfs none /d
> # cd /d/dwc3.0.auto # make sure this is the correct name for the directory
> # echo Compliance > link_state
> 
> There's some information about our debugfs interface here:
> 
> https://www.kernel.org/doc/html/latest/driver-api/usb/dwc3.html#debugfs
> 
> And implementation is under drivers/usb/dwc3/debugfs.c

I've tried to enable "Compliance" mode using the mechanism described
above (for dwc3.1.auto) but no luck. The echo command does not return an
error, but reading the link_state file immediately afterwards shows that
the link_state never changed to compliance mode. I tried entering
compliance mode while connected and disconnected to a host, but no luck.

Could you please provide more information here? Does the board have to
be connected to any host for this to work? or should it be connected to
a special USB test/certification instrument? We would ideally like to
get the board into compliance mode and have our high-speed signal
specialist take a look at the eye diagrams.

Just to clarify, when in compliance mode, do we have to send it any
specific commands? or will our device just start sending USB data and
all we have to do is listen and monitor with a scope?

> > The third screenshot titled "beta_USB3_otg_rx0.jpg" shows a cleaner eye
> > diagram without dashes in the middle. This sample is from a different
> > board from out "beta" design revision. The beta design also experiences
> > these mass storage mode problems.
> >
> > The redriver we are using is the TUSB542 from Texas Instruments:
> > http://www.ti.com/lit/ds/symlink/tusb542.pdf 
> >
> > I could not find any errata documents for this component. The datasheet
> > referenced above mentions a the U0, U2, and U3 link states briefly in
> > section 7.4 when providing an overview of features:
> >
> >> The TUSB542 deploys RX detect, LFPS signal detection and signal monitoring
> >> to implement an automatic power management scheme to provide active, U2/U3
> >> and disconnect modes. The automatic power management is driven by an
> >> advanced state machine, which is implemented to manage the device such that
> >> the re-driveroperatessmoothly in the links.
> >
> > U2 and U3 are also mentioned in the timing specifications. You mentioned
> > the device might not have exited U2 properly, so if the timing of the
> > redriver is a factor to consider here.
> >
> > Lastly, one big discovery we made was that we need AC coupling
> > capacitors on both sides of the redriver Tx paths, but we never added
> 
> it also has a guideline to route all differential pairs on the same
> layer and avoid vias, bends and test points. Also suggests keeping the
> distance of pairs at least 3 times the trace width (section 10.1)

So I just confirmed with our EE that the short answer is yes. We have
run several post-routing simulations with HyperLynx (stack-up included)
on our design and are aware of some signal attenuation (about 3dB loss
between the SoC and redriver). This takes all vias, bends, etc., into
account. Our design is pretty dense so we had to compromise a little bit
of signal strength. That's why we decided to add the redriver to our
design between alpha and beta revisions.

We also verified our length matching and all SS
pairs match within a couple of mils.

> > them to the Tx path between the SoC and the redriver. We only have these
> > caps on the Tx path between the redriver and the connector. Our EE will
> > be running a test to add coupling caps to see if this helps the issue.
> >
> > After further analysis of this issue, we think this might cause the
> > differential pairs to get out of sync. We also thing this might cause
> 
> It could be, sure. Please re-verify your length matching and
> differential impedance (should be 90-ohm +- 15%)

We received a report from the raw PCB manufacturer that shows the SS
traces are well within our specified tolerances. We specified 85-ohm +-
10% based on Intel's recommendation, and the average differential
impedances from several samples is somewhere around 85-86 ohms. We are
acqiuring a couple of raw PCBs now to confirm, but so far the data is
looking good.

> > the host controller in the SoC to be sinking a bit of current, causing
> > the link to fail in some way.
> >
> > I'll follow up with some more detail tomorrow as well as the results of
> > the disabled LPM test. If there's a way to disable LPM from the device,
> > please let me know :)
> 
> See above, but I don't know whether that will work :-)

So I used your approach to disable LPM from the host and it seems to
prevent the fall from SS to HS. Thank you for that suggestion. We also
reached out to Cypress and Intel about the issue and they noticed the
same issue and asked about the same recommendation. The unfortunate part
about this is that it doesn't seem to be a valid solution for devices in
the field. We won't be able to control the host computers the devices
are connected to and won't be able to disable LPM from the device.

Do you have any thoughts on why our board might be experiencing issues
with exiting U2? Does it seem like a timeout issue? The possibility of a
timeout is all I can come up with at the moment.

Intel also provided several other recommendations (this list is a direct quote):

   - Check if CHT client system use BIOS FRC 1.2.0 version or the latest
   BIOS FRC (as I remembered Modphy power gating should be disable in FRC
   1.2.0 or later, however it is better to confirm it again.)
   - Check if CHT(client) system can pass USB3.0 TX/RX test to make sure no
   EV issue on CHT board
   - Check if the host system(Win, or others) customer used can pass USB3.0
   TX/RX test to make sure no EV issue on host system
   - Check if Modphy power gating if disable as default in customer's CHT
   system, not sure if host(XHCI) and client(XDCI) mode are same setting for
   Modphy from BIOS setting. It might check with BIOS team for this.

I've asked this same question to our BIOS vendor, but do you know what
"modphy power gating" is and how it might be related to this issue?

Cheers,
Rob Weber