Re: Mass Storage Gadget Device Falls from SuperSpeed to High Speed

Felipe Balbi <felipe.balbi@xxxxxxxxxxxxxxx> · Wed, 27 Mar 2019 11:02:28 +0200




Hi,

Rob Weber <rob@xxxxxxxxxxx> writes:
> On Tue, Mar 26, 2019 at 11:14:47AM +0200, Felipe Balbi wrote:
>> >> > dmesg output is short (4 lines) and is probably helpful to find the
>> >> > meaningful timestamps of the dwc3 traces. The sequence of events on the
>> >> > device seems to start around 1346 seconds and ends around 1436 seconds.
>> >> 
>> >> right, from device perspective I see things behaving normally, then a
>> >> "random" reset shows up.
>> >
>> > In what way does this show up? Does the host send a reset command, does
>> > the device request the host port to reset, or does the device appear to
>> > reset itself?
>> 
>> this IRQ is generated based on Reset Signaling driven on the bus by the
>> host.
>> 
>> >> > Please note the "dwc3.1.auto: failed to enable ep0out" errors that are
>> >> > occurring when we connect our device to a host. These errors occur
>> >> > pretty regularly and we are not sure what is causing them.
>> >> 
>> >> this shouldn't happen, actually; where do you see those?
>> >
>> > They are in the dmesg output on the device. I was assuming they line up
>> 
>> huh? Here's what I have on device side dmesg:
>> 
>> [ 1345.662238] dwc3 dwc3.1.auto: request ffff880177cc4e40 was not queued to ep1out
>> [ 1346.581055] g_mass_storage gadget: super-speed config #1: Linux File-Backed Storage
>> [ 1435.982257] dwc3 dwc3.1.auto: request ffff880177cc4d80 was not queued to ep1out
>> [ 1436.264087] g_mass_storage gadget: high-speed config #1: Linux File-Backed Storage
>> 
>> That's the entire thing.
>> 
>> > with tracepoints in the dwc traces, but I wasn't sure what to look for.
>> > This error occurs approximately 75% of the time while the device is
>> > enumerating.
>> 
>> odd. Really shouldn't happen.
>
> After digging through my notes, I realized I am confusing us with two
> different errors my team has seen. Sorry about that. We experience the
> "request not queued" error part of the time. We've also seen the
> "failed to enable ep0out" error which was not observed during this
> particular test case. I'll see if I can re-create it and gather logs.

Sure thing, thanks

>> >> > I would appreciate any help or guidance anyone could provide in
>> >> > understanding more about this issue. We really want to understand why
>> >> > we're transitioning from superspeed to high speed and whether or not
>> >> > this issue is caused by software or hardware. I understand it might be
>> >> > difficult for someone to help with a custom design and an old kernel,
>> >> > but any pointers for further debugging would be incredibly helpful.
>> >> > Thank you in advance for taking a look at all this data!
>> >> 
>> >> Sure thing, I'll ask Mathias (XHCI maintainer) for his help on this too
>> >> since it's unclear what's causing the switch to HS when exitting U3.
>> >
>> > Thanks in advance for taking a look, Mathias! If there is any more
>> > information you need from me, please let me know.
>> 
>> XHCI tracepoints (together with dynamic printk of xhci) would be
>> great. Another thing to try is disabling LPM. It could be that your
>> redriver is one of those in the set of "can't handle LPM" redrivers.
>> 
>> That's an easy thing to try. I suppose below is possibly enough:
>> 
>> modified   drivers/usb/host/xhci-pci.c
>> @@ -154,7 +154,7 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
>>  		xhci->quirks |= XHCI_U2_DISABLE_WAKE;
>>  
>>  	if (pdev->vendor == PCI_VENDOR_ID_INTEL) {
>> -		xhci->quirks |= XHCI_LPM_SUPPORT;
>> +		/* xhci->quirks |= XHCI_LPM_SUPPORT; */
>>  		xhci->quirks |= XHCI_INTEL_HOST;
>>  		xhci->quirks |= XHCI_AVOID_BEI;
>>  	}
>> 
>
> I compiled a new kernel with these changes applied to the xhci driver
> and will test this tomorrow. This is my first time building and replacing
> the kernel on my host system, so this has been quite an experience.
> Now I have so many questions about the workflow of a kernel developer,
> but that is worth a whole new thread. Anyways...

Just ask away :-) If you want to ask about this in private, no problems
either.

> Is it possible to disable LPM support in the gadget? I was taking a look
> in drivers/usb/gadget/composite.c and found some references to LPM in
> the population of the ss_cap struct and the bmAttributes property. I
> wasn't sure of the cleanest way to disable it, so I would appreciate a
> pointer on where to make that change. It is generally faster for me to
> iterate on the target device's kernel.

I'd have to verify with specification again, IIRC U1/U2 is mandatory for
a USB SS device, but it's not mandatory for the host. Perhaps you could,
for testing purposes, just stall U1 Enable and U2 Enable requests, but I
don't know if that would work. Here's a patch for testing:

modified   drivers/usb/dwc3/ep0.c
@@ -456,9 +456,11 @@ static int dwc3_ep0_handle_device(struct dwc3 *dwc,
 	 * default control pipe
 	 */
 	case USB_DEVICE_U1_ENABLE:
+		return -EINVAL;
 		ret = dwc3_ep0_handle_u1(dwc, state, set);
 		break;
 	case USB_DEVICE_U2_ENABLE:
+		return -EINVAL;
 		ret = dwc3_ep0_handle_u2(dwc, state, set);
 		break;
 	case USB_DEVICE_LTM_ENABLE:

>> Felipe Balbi <felipe.balbi@xxxxxxxxxxxxxxx> writes:
>> >>> Sure, that would be great. How can I download it? Any link which I can
>> >>> curl? You can send it to me only, if you want; no problem.
>> >>
>> >> You (and anyone reading) can download it here:
>> >> https://drive.google.com/file/d/1WSJ-222bguXTsRZ-mI5u2M7IP9FsZdj7/view?usp=sharing
>> >
>> > I'll download it and have a look, thanks
>> 
>> Alright, the LeCroy traces made everything a lot more clear. So here's
>> what happens. A few assumptions first:
>> 
>> R is the right port where the Host is connected
>> L is the left port where the device is connected
>> 
>> 
>> We can start from LeCroy timestamp 61.271965376, that correlates with
>> DWC3 tracepoint 1422.092840, iow:
>> 
>>      irq/23-dwc3-995   [000] d..1  1422.092840: dwc3_event: event (00120301): Link Change [U2]
>> 
>> From there, we see a change to Recovery.Active in timestamp
>> 61.272120896. We don't see a link state change interrupt for this, so
>> it's not reported here. We can see on LeCroy traces that host sends
>> 372025 TS1 data packets for 12ms. From USB specification section
>> 7.5.10.3.1 we read that once entering Recovery.Active a 12ms timer is
>> started and (now on section 7.5.10.3.2) port should transition to
>> SS.Inactive when this timer expires. Hence:
>> 
>>      irq/23-dwc3-995   [000] d..1  1427.800204: dwc3_event: event (00160301): Link Change [SS.Inactive]
>> 
>> From that point, on LeCroy timestamp 61.284317360, we see a transition
>> to Rx.Detect which matches with:
>> 
>>      irq/23-dwc3-995   [000] d..1  1427.813205: dwc3_event: event (00150301): Link Change [RX.Detect]
>> 
>> From RX.Detect we see a series of transitions between Polling.LFPS and
>> back to RX.Detect which eventually time out, meaning that RX.Detect
>> failed and link transitions to SS.Disabled:
>> 
>>      irq/23-dwc3-995   [000] d..1  1427.911837: dwc3_event: event (00040301): Link Change [SS.Disabled]
>> 
>> From there, XHCI puts the SS link to U3 and, in order to do that it must
>> go through RX.Detect, U0 and finally U3 (note that here U0 comes before
>> RX.Detect, but I suppose that's some side effect of tracing since that's
>> not a valid transition per figure 7-14):
>> 
>>      irq/23-dwc3-995   [000] d..1  1427.911849: dwc3_event: event (00000301): Link Change [U0]
>>      irq/23-dwc3-995   [000] d..1  1427.914818: dwc3_event: event (00050301): Link Change [RX.Detect]
>>      irq/23-dwc3-995   [000] d..1  1427.917746: dwc3_event: event (00030301): Link Change [U3]
>> 
>> Now, we get a reset which is actually an SE0 signal on the bus:
>> 
>>      irq/23-dwc3-995   [000] d..1  1435.982020: dwc3_event: event (00000101): Reset [U0]
>> 
>> For whatever reason that shows as Full Speed J on LeCroy traces. Go
>> figure.
>> 
>> In any case, the problem here seems to be that device side was unable to
>> transmit TS1 when exiting U2. Then it was, consequently, unable to
>> perform RX.Detect and link was disabled and switched to full-speed and
>> negotiated high-speed later on by means of chirp sequences.
>> 
>> I would really suggest running without LPM to get some more information
>> since this seems to be rather clearly related to U2 exit. At the point
>> of failure, it may also be a good idea to check the state of your PHYs
>> it could be that the PHY got stuck in some weird state. Another thing to
>> look at would be super speed eye diagram measurements, known errata for
>> PHYs and redriver.
>
> Our team spent most of the day working with a high-speed signal
> specialist to understand this issue more at the electrical level and we
> have quite a few new findings.
>
> First, I'e attached Three screenshots of eye diagrams. The two images

that's an odd eye diagram. Usually, we have a picture which overlays
several waveforms and produces something that looks like an eye. Like
this image:

http://www.testusb.com/images/near%20end.png

> titled USB3_otg_rx0_data.jpg and USB3_otg_tx0_data1.jpg Are from our
> production revision of our design. They show the differential input
> and output on the SoC side of the redriver/mux (between SoC and
> redriver/mux). Notice the "dashes" in the midle of the eye area. We've
> observed these dashes on the Tx path between the redriver and the
> connector as well but we do not have a screenshot. We are unable to
> probe the Rx path between the redriver and the connector because the
> signals are routed on inner layers of the board.

okay. BTW, is this eye diagram on the HS pair or SS pairs? Just to be
clear, we should look at signal quality of the SS pairs, since the
problem happens there. HS is behaving fine according to logs.

Electrical quality measurement with SS pairs is done by placing link in
compliance mode. With dwc3 you can either force link to compliace via
debugfs (see below) or you can have a scope with certification tools
installed and that will move the link through a pre-specified pattern of
link states and force the DUT to enter compliance.

To force dwc3 to compliance, here's what you do on device side as root:

# mkdir -p /d
# mount -t debugfs none /d
# cd /d/dwc3.0.auto # make sure this is the correct name for the directory
# echo Compliance > link_state

There's some information about our debugfs interface here:

https://www.kernel.org/doc/html/latest/driver-api/usb/dwc3.html#debugfs

And implementation is under drivers/usb/dwc3/debugfs.c

> The third screenshot titled "beta_USB3_otg_rx0.jpg" shows a cleaner eye
> diagram without dashes in the middle. This sample is from a different
> board from out "beta" design revision. The beta design also experiences
> these mass storage mode problems.
>
> The redriver we are using is the TUSB542 from Texas Instruments:
> http://www.ti.com/lit/ds/symlink/tusb542.pdf 
>
> I could not find any errata documents for this component. The datasheet
> referenced above mentions a the U0, U2, and U3 link states briefly in
> section 7.4 when providing an overview of features:
>
>> The TUSB542 deploys RX detect, LFPS signal detection and signal monitoring
>> to implement an automatic power management scheme to provide active, U2/U3
>> and disconnect modes. The automatic power management is driven by an
>> advanced state machine, which is implemented to manage the device such that
>> the re-driveroperatessmoothly in the links.
>
> U2 and U3 are also mentioned in the timing specifications. You mentioned
> the device might not have exited U2 properly, so if the timing of the
> redriver is a factor to consider here.
>
> Lastly, one big discovery we made was that we need AC coupling
> capacitors on both sides of the redriver Tx paths, but we never added

it also has a guideline to route all differential pairs on the same
layer and avoid vias, bends and test points. Also suggests keeping the
distance of pairs at least 3 times the trace width (section 10.1)

> them to the Tx path between the SoC and the redriver. We only have these
> caps on the Tx path between the redriver and the connector. Our EE will
> be running a test to add coupling caps to see if this helps the issue.
>
> After further analysis of this issue, we think this might cause the
> differential pairs to get out of sync. We also thing this might cause

It could be, sure. Please re-verify your length matching and
differential impedance (should be 90-ohm +- 15%)

> the host controller in the SoC to be sinking a bit of current, causing
> the link to fail in some way.
>
> I'll follow up with some more detail tomorrow as well as the results of
> the disabled LPM test. If there's a way to disable LPM from the device,
> please let me know :)

See above, but I don't know whether that will work :-)

-- 
balbi
Attachment:
signature.asc

Description: PGP signature