Re: Mass Storage Gadget Device Falls from SuperSpeed to High Speed

Rob Weber <rob@xxxxxxxxxxx> · Thu, 28 Mar 2019 01:53:43 -0700

Hello Felipe,

On Thu, Mar 28, 2019 at 08:36:12AM +0200, Felipe Balbi wrote:
> Rob Weber <rob@xxxxxxxxxxx> writes:
> >> >> Felipe Balbi <felipe.balbi@xxxxxxxxxxxxxxx> writes:
> >> >> >>> Sure, that would be great. How can I download it? Any link which I can
> >> >> >>> curl? You can send it to me only, if you want; no problem.
> >> >> >>
> >> >> >> You (and anyone reading) can download it here:
> >> >> >> https://drive.google.com/file/d/1WSJ-222bguXTsRZ-mI5u2M7IP9FsZdj7/view?usp=sharing
> >> >> >
> >> >> > I'll download it and have a look, thanks
> >> >> 
> >> >> Alright, the LeCroy traces made everything a lot more clear. So here's
> >> >> what happens. A few assumptions first:
> >> >> 
> >> >> R is the right port where the Host is connected
> >> >> L is the left port where the device is connected
> >> >> 
> >> >> 
> >> >> We can start from LeCroy timestamp 61.271965376, that correlates with
> >> >> DWC3 tracepoint 1422.092840, iow:
> >> >> 
> >> >>      irq/23-dwc3-995   [000] d..1  1422.092840: dwc3_event: event (00120301): Link Change [U2]
> >> >> 
> >> >> From there, we see a change to Recovery.Active in timestamp
> >> >> 61.272120896. We don't see a link state change interrupt for this, so
> >> >> it's not reported here. We can see on LeCroy traces that host sends
> >> >> 372025 TS1 data packets for 12ms. From USB specification section
> >> >> 7.5.10.3.1 we read that once entering Recovery.Active a 12ms timer is
> >> >> started and (now on section 7.5.10.3.2) port should transition to
> >> >> SS.Inactive when this timer expires. Hence:
> >> >> 
> >> >>      irq/23-dwc3-995   [000] d..1  1427.800204: dwc3_event: event (00160301): Link Change [SS.Inactive]
> >> >> 
> >> >> From that point, on LeCroy timestamp 61.284317360, we see a transition
> >> >> to Rx.Detect which matches with:
> >> >> 
> >> >>      irq/23-dwc3-995   [000] d..1  1427.813205: dwc3_event: event (00150301): Link Change [RX.Detect]
> >> >> 
> >> >> From RX.Detect we see a series of transitions between Polling.LFPS and
> >> >> back to RX.Detect which eventually time out, meaning that RX.Detect
> >> >> failed and link transitions to SS.Disabled:
> >> >> 
> >> >>      irq/23-dwc3-995   [000] d..1  1427.911837: dwc3_event: event (00040301): Link Change [SS.Disabled]
> >> >> 
> >> >> From there, XHCI puts the SS link to U3 and, in order to do that it must
> >> >> go through RX.Detect, U0 and finally U3 (note that here U0 comes before
> >> >> RX.Detect, but I suppose that's some side effect of tracing since that's
> >> >> not a valid transition per figure 7-14):
> >> >> 
> >> >>      irq/23-dwc3-995   [000] d..1  1427.911849: dwc3_event: event (00000301): Link Change [U0]
> >> >>      irq/23-dwc3-995   [000] d..1  1427.914818: dwc3_event: event (00050301): Link Change [RX.Detect]
> >> >>      irq/23-dwc3-995   [000] d..1  1427.917746: dwc3_event: event (00030301): Link Change [U3]
> >> >> 
> >> >> Now, we get a reset which is actually an SE0 signal on the bus:
> >> >> 
> >> >>      irq/23-dwc3-995   [000] d..1  1435.982020: dwc3_event: event (00000101): Reset [U0]
> >> >> 
> >> >> For whatever reason that shows as Full Speed J on LeCroy traces. Go
> >> >> figure.
> >> >> 
> >> >> In any case, the problem here seems to be that device side was unable to
> >> >> transmit TS1 when exiting U2. Then it was, consequently, unable to
> >> >> perform RX.Detect and link was disabled and switched to full-speed and
> >> >> negotiated high-speed later on by means of chirp sequences.
> >> >> 
> >> >> I would really suggest running without LPM to get some more information
> >> >> since this seems to be rather clearly related to U2 exit. At the point
> >> >> of failure, it may also be a good idea to check the state of your PHYs
> >> >> it could be that the PHY got stuck in some weird state. Another thing to
> >> >> look at would be super speed eye diagram measurements, known errata for
> >> >> PHYs and redriver.
> >> >
> >> > Our team spent most of the day working with a high-speed signal
> >> > specialist to understand this issue more at the electrical level and we
> >> > have quite a few new findings.
> >> >
> >> > First, I'e attached Three screenshots of eye diagrams. The two images
> >> 
> >> that's an odd eye diagram. Usually, we have a picture which overlays
> >> several waveforms and produces something that looks like an eye. Like
> >> this image:
> >> 
> >> http://www.testusb.com/images/near%20end.png
> >
> > I asked the high-speed signal specialist about this and he said that our
> > eye diagram does not properly resemble an eye because we are unable to
> > get a trigger lock while monitoring the SS signals. We do not have an
> > external trigger source connected to the oscilloscope, nor is the
> > oscilloscope acting as the USB host during these tests. So without a
> > proper mechanism for synchronizing the data with the scope, we won't be
> > able to get a proper eye diagram. We just have to infer based on the
> > dark and light patches visible in the scope.
> 
> Understood. One way to get a slightly better view, is to put your
> display persistency to maximum and use a regular rise or fall trigger
> point. It's not perfect, but I've used that in the past :-)
> 
> Minor detail though, thanks for explaining the situation.

Nice tip! I'll pass this onto our EE and high-speed signal specialist
to see if this helps.
> 
> >> > titled USB3_otg_rx0_data.jpg and USB3_otg_tx0_data1.jpg Are from our
> >> > production revision of our design. They show the differential input
> >> > and output on the SoC side of the redriver/mux (between SoC and
> >> > redriver/mux). Notice the "dashes" in the midle of the eye area. We've
> >> > observed these dashes on the Tx path between the redriver and the
> >> > connector as well but we do not have a screenshot. We are unable to
> >> > probe the Rx path between the redriver and the connector because the
> >> > signals are routed on inner layers of the board.
> >> 
> >> okay. BTW, is this eye diagram on the HS pair or SS pairs? Just to be
> >> clear, we should look at signal quality of the SS pairs, since the
> >> problem happens there. HS is behaving fine according to logs.
> >
> > Eye diagrams are definitely on SS pairs.
> 
> Thanks for confirming
> 
> >> Electrical quality measurement with SS pairs is done by placing link in
> >> compliance mode. With dwc3 you can either force link to compliace via
> >> debugfs (see below) or you can have a scope with certification tools
> >> installed and that will move the link through a pre-specified pattern of
> >> link states and force the DUT to enter compliance.
> >> 
> >> To force dwc3 to compliance, here's what you do on device side as root:
> >> 
> >> # mkdir -p /d
> >> # mount -t debugfs none /d
> >> # cd /d/dwc3.0.auto # make sure this is the correct name for the directory
> >> # echo Compliance > link_state
> >> 
> >> There's some information about our debugfs interface here:
> >> 
> >> https://www.kernel.org/doc/html/latest/driver-api/usb/dwc3.html#debugfs
> >> 
> >> And implementation is under drivers/usb/dwc3/debugfs.c
> >
> > I've tried to enable "Compliance" mode using the mechanism described
> > above (for dwc3.1.auto) but no luck. The echo command does not return an
> > error, but reading the link_state file immediately afterwards shows that
> > the link_state never changed to compliance mode. I tried entering
> > compliance mode while connected and disconnected to a host, but no luck.
> 
> Interesting. That used to work fine.
> 
> > Could you please provide more information here? Does the board have to
> > be connected to any host for this to work? or should it be connected to
> > a special USB test/certification instrument? We would ideally like to
> > get the board into compliance mode and have our high-speed signal
> > specialist take a look at the eye diagrams.
> 
> It shouldn't matter. We would just force link to Compliance. No idea why
> it stopped working. I'll add that to my TODO list.

I was testing on kernel 4.9.115, so it might be very different on a more
recent version. I plan on testing a more recent kernel tomorrow
(finally) so I'll be sure to test compliance link as well.

> > Just to clarify, when in compliance mode, do we have to send it any
> > specific commands? or will our device just start sending USB data and
> > all we have to do is listen and monitor with a scope?
> 
> When link enters compliance, the USB controller (in this case, dwc3)
> would start sending a pseudo-random sequence of symbols (the compliance
> pattern). All you would have to do is listen.
> 
> >> > The third screenshot titled "beta_USB3_otg_rx0.jpg" shows a cleaner eye
> >> > diagram without dashes in the middle. This sample is from a different
> >> > board from out "beta" design revision. The beta design also experiences
> >> > these mass storage mode problems.
> >> >
> >> > The redriver we are using is the TUSB542 from Texas Instruments:
> >> > http://www.ti.com/lit/ds/symlink/tusb542.pdf 
> >> >
> >> > I could not find any errata documents for this component. The datasheet
> >> > referenced above mentions a the U0, U2, and U3 link states briefly in
> >> > section 7.4 when providing an overview of features:
> >> >
> >> >> The TUSB542 deploys RX detect, LFPS signal detection and signal monitoring
> >> >> to implement an automatic power management scheme to provide active, U2/U3
> >> >> and disconnect modes. The automatic power management is driven by an
> >> >> advanced state machine, which is implemented to manage the device such that
> >> >> the re-driveroperatessmoothly in the links.
> >> >
> >> > U2 and U3 are also mentioned in the timing specifications. You mentioned
> >> > the device might not have exited U2 properly, so if the timing of the
> >> > redriver is a factor to consider here.
> >> >
> >> > Lastly, one big discovery we made was that we need AC coupling
> >> > capacitors on both sides of the redriver Tx paths, but we never added
> >> 
> >> it also has a guideline to route all differential pairs on the same
> >> layer and avoid vias, bends and test points. Also suggests keeping the
> >> distance of pairs at least 3 times the trace width (section 10.1)
> >
> > So I just confirmed with our EE that the short answer is yes. We have
> > run several post-routing simulations with HyperLynx (stack-up included)
> > on our design and are aware of some signal attenuation (about 3dB loss
> > between the SoC and redriver). This takes all vias, bends, etc., into
> > account. Our design is pretty dense so we had to compromise a little bit
> > of signal strength. That's why we decided to add the redriver to our
> > design between alpha and beta revisions.
> >
> > We also verified our length matching and all SS
> > pairs match within a couple of mils.
> 
> That's great, all of those can be ruled out.
> 
> >> > them to the Tx path between the SoC and the redriver. We only have these
> >> > caps on the Tx path between the redriver and the connector. Our EE will
> >> > be running a test to add coupling caps to see if this helps the issue.
> >> >
> >> > After further analysis of this issue, we think this might cause the
> >> > differential pairs to get out of sync. We also thing this might cause
> >> 
> >> It could be, sure. Please re-verify your length matching and
> >> differential impedance (should be 90-ohm +- 15%)
> >
> > We received a report from the raw PCB manufacturer that shows the SS
> > traces are well within our specified tolerances. We specified 85-ohm +-
> > 10% based on Intel's recommendation, and the average differential
> > impedances from several samples is somewhere around 85-86 ohms. We are
> > acqiuring a couple of raw PCBs now to confirm, but so far the data is
> > looking good.
> 
> That's great too, also can be ruled out.

Yeah, overall we're feeling relatively confident in our design. A nice
eye diagram would be the last piece of evidence we need to rule out
hardware.

> >> > the host controller in the SoC to be sinking a bit of current, causing
> >> > the link to fail in some way.
> >> >
> >> > I'll follow up with some more detail tomorrow as well as the results of
> >> > the disabled LPM test. If there's a way to disable LPM from the device,
> >> > please let me know :)
> >> 
> >> See above, but I don't know whether that will work :-)
> >
> > So I used your approach to disable LPM from the host and it seems to
> > prevent the fall from SS to HS. Thank you for that suggestion. We also
> 
> Great! Now we can confirm that the problem is indeed because of a
> failure in U2 exit.
> 
> > reached out to Cypress and Intel about the issue and they noticed the
> > same issue and asked about the same recommendation. The unfortunate part
> 
> Interesting. Who are you in touch with?

We usually just open up support tickets with our FAE. A different
support engineer usually responds to our questions for each issue.
Jerry, a support specialist, responded to this particular request.
I do not have Jerry's email or other identifier. If you would like
to find out, I could send you our Intel Premiere Support Case number
In a private thread if that helps you find our who it is.

> > about this is that it doesn't seem to be a valid solution for devices in
> > the field. We won't be able to control the host computers the devices
> > are connected to and won't be able to disable LPM from the device.
> 
> Completely agree. That was just a simple test to pin-point the problem
> area.
> 
> > Do you have any thoughts on why our board might be experiencing issues
> > with exiting U2? Does it seem like a timeout issue? The possibility of a
> > timeout is all I can come up with at the moment.
> 
> If I were to guess, I'd say the PHY is getting confused.

Which PHY are you referring to? The dwc3?

> > Intel also provided several other recommendations (this list is a direct quote):
> >
> >    - Check if CHT client system use BIOS FRC 1.2.0 version or the latest
> >    BIOS FRC (as I remembered Modphy power gating should be disable in FRC
> >    1.2.0 or later, however it is better to confirm it again.)
> >    - Check if CHT(client) system can pass USB3.0 TX/RX test to make sure no
> >    EV issue on CHT board
> 
> This is something you have already done. Your board seems to be fine
> from a signal quality perspective.
> 
> >    - Check if the host system(Win, or others) customer used can pass USB3.0
> >    TX/RX test to make sure no EV issue on host system
> 
> I'd say this is pretty pointless. We can't expect you to provide a list
> of which hosts your product will work against.
> 
> >    - Check if Modphy power gating if disable as default in customer's CHT
> >    system, not sure if host(XHCI) and client(XDCI) mode are same setting for
> >    Modphy from BIOS setting. It might check with BIOS team for this.
> >
> > I've asked this same question to our BIOS vendor, but do you know what
> > "modphy power gating" is and how it might be related to this issue?
> 
> modphy is the USB PHY integrated in your SoC. There's no control for
> that from OS side, only BIOS unfortunately. There is, however, one thing
> we can try. DWC3 has several quirk flags for known quirky PHYs; perhaps
> CHT needs one of those. Can you try with this patch and let me know
> whether it helps?

Sure thing, I will try tomorrow. Could you possibly explain what a quirk
is as it relates to the kernel? I see this all over the source tree but
never knew how it was used. Does the dwc3 also know about "quirks" and
these particular flags? or are these flags just specific to the kernel
and its functionality?

> modified   drivers/usb/dwc3/dwc3-pci.c
> @@ -105,6 +105,8 @@ static int dwc3_byt_enable_ulpi_refclock(struct pci_dev *pci)
>  static const struct property_entry dwc3_pci_intel_properties[] = {
>  	PROPERTY_ENTRY_STRING("dr_mode", "peripheral"),
>  	PROPERTY_ENTRY_BOOL("linux,sysdev_is_parent"),
> +	PROPERTY_ENTRY_BOOL("snps,dis_u3_susphy_quirk"),
> +	PROPERTY_ENTRY_BOOL("snps,dis_u2_susphy_quirk"),
>  	{}
>  };
>  
> These two quirks will PHY suspend. There are other relevant quirk flags

What do you mean by PHY suspend? Will it disable U2/U3 for the dwc3? I
see it modified the DWC3_GUSB2PHYCFG_SUSPHY bit in the configuration
register, but I don't have access to the dwc3 databook to dig deeper
into this.

> which we can try in case these two don't help. I'd like to figure out
> exactly which quirk flag helps (if any). After that, we would need to
> check if a similar problem happens on any CHT system or just your
> design.
> 
> If it happens on any other system, then I can make sure we add a quirk
> flag to all CHTs.

Sounds good!

Thanks for taking the time to answer my questions! It's definitely
helpful for my understanding of USB. I'm learning quite a bit of
new information with each email and it's pretty awesome.

Cheers,
Rob Weber