Re: Mass Storage Gadget Device Falls from SuperSpeed to High Speed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Rob Weber <rob@xxxxxxxxxxx> writes:
> On Wed, Apr 03, 2019 at 10:48:36AM +0300, Felipe Balbi wrote:
>> >> > * We still saw the drop from SS to HS on a Windows laptop. Linux and Mac
>> >> >   seemed to be fine, but there's definitely a chance we did not test
>> >> >   with enough samples to confidently say this quirk is stable with those
>> >> >   hosts.
>> >> 
>> >> okay, so still not good enough. Let's try to prevent the device side
>> >> from ever initiating U1/U2 entry and from ever accepting LGO_U1 and
>> >> LGO_U2 from the host. Here's a (hack) patch for that
>> >> 
>> >> modified   drivers/usb/dwc3/ep0.c
>> >> @@ -382,7 +382,7 @@ static int dwc3_ep0_handle_u1(struct dwc3 *dwc, enum usb_device_state state,
>> >>  
>> >>  	reg = dwc3_readl(dwc->regs, DWC3_DCTL);
>> >>  	if (set)
>> >> -		reg |= DWC3_DCTL_INITU1ENA;
>> >> +		return -EINVAL;
>> >>  	else
>> >>  		reg &= ~DWC3_DCTL_INITU1ENA;
>> >>  	dwc3_writel(dwc->regs, DWC3_DCTL, reg);
>> >> @@ -404,7 +404,7 @@ static int dwc3_ep0_handle_u2(struct dwc3 *dwc, enum usb_device_state state,
>> >>  
>> >>  	reg = dwc3_readl(dwc->regs, DWC3_DCTL);
>> >>  	if (set)
>> >> -		reg |= DWC3_DCTL_INITU2ENA;
>> >> +		return -EINVAL;
>> >>  	else
>> >>  		reg &= ~DWC3_DCTL_INITU2ENA;
>> >>  	dwc3_writel(dwc->regs, DWC3_DCTL, reg);
>> >> @@ -625,9 +625,9 @@ static int dwc3_ep0_set_config(struct dwc3 *dwc, struct usb_ctrlrequest *ctrl)
>> >>  			 * Enable transition to U1/U2 state when
>> >>  			 * nothing is pending from application.
>> >>  			 */
>> >> -			reg = dwc3_readl(dwc->regs, DWC3_DCTL);
>> >> -			reg |= (DWC3_DCTL_ACCEPTU1ENA | DWC3_DCTL_ACCEPTU2ENA);
>> >> -			dwc3_writel(dwc->regs, DWC3_DCTL, reg);
>> >> +			/* reg = dwc3_readl(dwc->regs, DWC3_DCTL); */
>> >> +			/* reg |= (DWC3_DCTL_ACCEPTU1ENA | DWC3_DCTL_ACCEPTU2ENA); */
>> >> +			/* dwc3_writel(dwc->regs, DWC3_DCTL, reg); */
>> >>  		}
>> >>  		break;
>> >> 
>> >> 
>> >> I don't think it will apply cleanly to your kernel, let me know if you
>> >> need help porting it.
>> >
>> > Thank you for the offer. It wasn't too bad after reading the ep0.c file
>> > for both 4.9.115 and master. I've included the patch in the attached
>> > tarball if you would like to see how I ported it.
>> 
>> Thank you. BTW, it may very well be that you only need to disable U2 and
>> leave U1 enabled. That may help save some power, but I don't know
>> whether you have requirements for this :-)
>
> Ahh, good point. We might come back to this after we finish testing with
> the current patch. Thanks!
>
>> >> > So tomorrow I hope to gather more concrete data to back up these
>> >> > findings. Our USB protocol analyzer is also out-of-order at the moment,
>> >> > so we might not be able to get analyzer data for a day or two either.
>> >> >
>> >> > Are there some other quirks you would like us to try out?
>> >> 
>> >> See above, not a quirk flag, but just a complete removal of U1/U2 entry
>> >> from the device side.
>> >
>> > Thank you for providing the above patch. It seems quite stable for a
>> > variety of hosts! We want to continue testing this on several samples
>> > with a bunch of different host computers and cables to make sure we're
>> > in the clear, but this patch is feeling pretty good.
>> 
>> Good to hear, please update me once you're happy with the results. Also,
>> if you notice this problem happening again in the future, don't hesitate
>> to contact me.
>
> The results are very promising, and our team is feeling very confident
> after hundreds of connections and disconnections :)

Good. Just thought I'd mention that we use a nice tool to help testing
such long connect/disconnect cycles.

http://www.cleware-shop.de/epages/63698188.sf/en_US/?ObjectPath=/Shops/63698188/Products/1001/SubProducts/1001-1

I'm not endorsing the company in any way, but we that little tool helps
us quite a bit. I wrote a tool to use that many years ago and I still
use it. It's available as part of another project where I keep testing
tools:

https://github.com/felipebalbi/usb-tools

The name of the binary will be "cleware".

> We've identified a couple of auxiliary issues that seem to be unrelated,
> but we are trying to gather more data to be sure. One issue we are
> experiencing is particularly clear: We seem to have a race condition
> between the switching of channels on the SS mux on our platform and the
> enumeration of our device. We believe our device gets enumerated in high
> speed before we switch the channel of the SS mux. This can be clearly
> observed while watching the logs of the MCU on our platform while
> simultaneously watching dmesg. Below is an example of this test setup.
> Please note the MCU logs begin with [general][DBG]:
>
> root@gnarbox-2:~# echo device > /sys/class/usb_role/intel_xhci_usb_sw-role-switch/role
> root@gnarbox-2:~# ./bin/usb-trace 1
> echo 1 > /t/events/dwc3/enable
> echo 1 > /t/events/xhci-hcd/enable
> echo 1 > /t/events/gadget/enable
> root@gnarbox-2:~# modprobe g_mass_storage file=/dev/nvme0n1p7 iSerialNumber=90405
> root@gnarbox-2:~# stty -F /dev/ttyS2 115200 cs8 cstopb -parenb -icrnl
> root@gnarbox-2:~# cat /dev/ttyS2 &
> [1] 1733
> root@gnarbox-2:~# dmesg -w
>
> Okay, setup complete, aaand we're rolling...
>
> [general] [DBG] handle_intr: (otg) event: Rp Change Detected()
> [general] [DBG] apply_pdo: setting limit based on Rp val 0
> [general] [DBG] set_current_limit: IIN_HOST: 100 0x0100
> [general] [DBG] set_current_limit: success (tried 1 times)
> [general] [DBG] handle_intr: (otg) event: Type-C Port Connect Detected()
> [general] [DBG] usb_mux_sel: (otg) polarity: 2
>
> Here's where the PD controller detects the connection. Note the polarity
> of the cable that we detect.
>
> [general] [DBG] handle_intr: (otg) Nothing to do.
> [general] [DBG] handle_intr: (otg) event: Hard Reset Sent to Port Partner
> [  179.905835] g_mass_storage gadget: super-speed config #1: Linux File-Backed Storage
>
> shortly after, the device is enumerated and the link speed is set to 5Gbps
>
> [general] [DBG] handle_intr: (otg) event: Hard Reset Sent to Port Partner
> [general] [DBG] handle_intr: (otg) event: Hard Reset Sent to Port Partner
> [general] [DBG] handle_intr: (otg) event: Source Disabled State Entered
> [general] [DBG] enable_charging: incorrect current limit(3300), setting(100)
> [general] [DBG] set_current_limit: IIN_HOST: 100 0x0100
> [general] [DBG] set_current_limit: success (tried 1 times)
> ...
> ...
> [general] [DBG] handle_intr: (otg) event: Type-C Port Disconnect Detected()
> [general] [DBG] apply_pdo: setting limit based on Rp val 0
> [general] [DBG] set_current_limit: IIN_HOST: 100 0x0100
> [general] [DBG] set_current_limit: success (tried 1 times)
> [general] [DBG] disable_charging:
>
> Here we disconnect the cable from our device.
>
> [  226.759843] g_mass_storage gadget: high-speed config #1: Linux File-Backed Storage
> [general] [DBG] handle_intr: (otg) event: Type-C Port Connect Detected()
> [general] [DBG] usb_mux_sel: (otg) polarity: 0
>
> Here we reconnected our device to the host but with the cable in a
> different polarity than before. This means the SS mux is not properly
> configured for the current cable orientation and must be switched. Also
> note here that linux seems to report the enumeration in high speed
> before our MCU had a chance to react to the new connection.
>
> [general] [DBG] handle_intr: (otg) event: Hard Reset Sent to Port Partner
> [general] [DBG] handle_intr: (otg) event: Hard Reset Sent to Port Partner
> [general] [DBG] handle_intr: (otg) event: Hard Reset Sent to Port Partner
> [general] [DBG] handle_intr: (otg) event: Source Disabled State Entered
> [general] [DBG] enable_charging: incorrect current limit(3300), setting(100)
> [general] [DBG] set_current_limit: IIN_HOST: 100 0x0100
> [general] [DBG] set_current_limit: success (tried 1 times)
> ...
> ...
> [general] [DBG] handle_intr: (otg) event: Type-C Port Disconnect Detected()
> [general] [DBG] apply_pdo: setting limit based on Rp val 0
> [general] [DBG] set_current_limit: IIN_HOST: 100 0x0100
> [general] [DBG] set_current_limit: success (tried 1 times)
> [general] [DBG] disable_charging:
>
> Here we disconnect our device from the host by removing the cable from
> our device's type-c connector.
>
> [  262.783318] dwc3 dwc3.1.auto: request ffff880168ae3240 was not queued to ep1out
> [  262.920461] g_mass_storage gadget: super-speed config #1: Linux File-Backed Storage
> [general] [DBG] handle_intr: (otg) event: Type-C Port Connect Detected()
> [general] [DBG] usb_mux_sel: (otg) polarity: 0
>
> Here we re-connect our device to the host, but we made sure to leave the
> cable in the same orientation as before. That way the SS mux is already
> configured for the current cable orientation. We still see the USB
> device enumeration occur before the MCU has a chance to respond, but it
> gets enumerated in SuperSpeed because the SS mux was already configured
> for this cable orientation.

This looks like an old bug on xhci which was fixed long ago. Mathias, do
you remember that bug? Which kernel fixed that issue?

> We've come up with a few solutions that are specific to our platform
> in order to reduce the chances of this race condition occuring, but I
> wanted to ask you for your thoughts about how we might be able to work
> around this from Linux. In the event that we cannot completely eliminate
> this race condition, we would like to find some way to "upgrade" the
> connection from high speed to SuperSpeed. Is something like this
> possible?

Unfortunately not. Does this continue to happen if you change the host
machine to e.g. Windows or a more recent linux kernel?

> My naiive understanding wsas that the host would continuously
> poll the SS lines for a valid connection, but I believe I am mistaken
> here. I noticed in the USB protocol traces that there are a couple of
> Link Management Packets (LPMs) that the host sends. One of them seems to
> request the configuration descriptor, and another seems to request the
> link speed be set to 5 Gbps, to which the device acknowledges and
> complies. Is it possible to force this sort of exchange to happen again? 

AFAICT, that's handled entirely at the HW level without SW intervention.

> I've attached yet another tarball containing the dwc3 traces, USB
> protocol data (from a Teledyne LeCroy analyzer, compatible with the "USB
> Protocol Suite" application), and a raw copy of the logs I annotated
> above. The timestamps in the dwc3 traces should line up with the
> timestamps of the dmesg output in the annotated logs. The USB protocol
> data is from a different session, but still contains an example of the
> first enumeration in high speed, and the second in SS, all in one
> sample. The SS enumeration happens at around 8 seconds into the
> 12-second capture.
>
> Thanks in advance for your input here!

I'll have to spend some more time with the logs, but at least from
traces, I couldn't see anything that would give a hint. I'll try looking
at CATC traces later today tomorrow.

>> > I've attached a tarball containing the dmesg output, regdump, and
>> > tracepoints of a succesful session with the patch applied. I've included
>> > the patch in the tarball if you would like to see how I ported it.
>> > Could you please double-check the traces to make sure they meet your
>> > expectations for disabling the U1/U2_ENABLE features? I looked through
>> > the traces and could not find any link state changes to U1 or U2, so it
>> > seems like it's working.
>> 
>> The host, from that run, didn't enable device-initiated U1/U2 entry. So
>> only host-initiated entry is avaiable. Since we have removed
>> ACCEPTU1/ACCEPTU2 bits from DCTL, dwc3 is denying LGO_U1 and LGO_U2. You
>> can confirm that only with a sniffer.
>> 
>> > I have a couple of follow-up questions about this patch:
>> >
>> > * I added some tracepoints in the patch I'm applying but they don't seem
>> >   to be included in the trace output. This patch applies cleanly and I
>> >   haven't seen any compiler warnings about it. Is the
>> >   dwc3_ep0_handle_feature function not called during
>> 
>> it's called from SetFeature() command. If SetFeature() is not sent by
>> the host, then you won't see your tracepoints. FYI, SetFeature would
>> have a bRequest of 3. Here we can all ctrl requests:
>> 
>>  192:     irq/23-dwc3-1765  [000] d..1   213.007591: dwc3_ctrl_req: bRequestType 00 bRequest 05 wValue 0005 wIndex 0000 wLength 0
>>  208:     irq/23-dwc3-1765  [000] d..1   213.010457: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0100 wIndex 0000 wLength 18
>>  237:     irq/23-dwc3-1765  [000] d..1   213.010871: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0304 wIndex 0409 wLength 2
>>  266:     irq/23-dwc3-1765  [000] d..1   213.011330: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0304 wIndex 0409 wLength 40
>>  295:     irq/23-dwc3-1765  [000] d..1   213.011707: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0303 wIndex 0409 wLength 2
>>  324:     irq/23-dwc3-1765  [000] d..1   213.012137: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0303 wIndex 0409 wLength 92
>>  353:     irq/23-dwc3-1765  [000] d..1   213.012548: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0305 wIndex 0409 wLength 2
>>  382:     irq/23-dwc3-1765  [000] d..1   213.012934: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0305 wIndex 0409 wLength 12
>>  411:     irq/23-dwc3-1765  [000] d..1   213.013331: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0f00 wIndex 0000 wLength 5
>>  440:     irq/23-dwc3-1765  [000] d..1   213.013748: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0f00 wIndex 0000 wLength 22
>>  469:     irq/23-dwc3-1765  [000] d..1   213.016006: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0200 wIndex 0000 wLength 9
>>  498:     irq/23-dwc3-1765  [000] d..1   213.016543: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0200 wIndex 0000 wLength 44
>>  527:     irq/23-dwc3-1765  [000] d..1   213.016951: dwc3_ctrl_req: bRequestType 00 bRequest 09 wValue 0001 wIndex 0000 wLength 0
>>  568:     irq/23-dwc3-1765  [000] d..1   213.017569: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0301 wIndex 0409 wLength 2
>>  597:     irq/23-dwc3-1765  [000] d..1   213.017926: dwc3_ctrl_req: bRequestType 80 bRequest 06 wValue 0301 wIndex 0409 wLength 26
>>  626:     irq/23-dwc3-1765  [000] d..1   213.018367: dwc3_ctrl_req: bRequestType 81 bRequest 00 wValue 0000 wIndex 0000 wLength 2
>>  653:     irq/23-dwc3-1765  [000] d..1   213.026404: dwc3_ctrl_req: bRequestType a1 bRequest fe wValue 0000 wIndex 0000 wLength 1
>>  929:     irq/23-dwc3-1765  [000] d..1   214.138385: dwc3_ctrl_req: bRequestType 82 bRequest 00 wValue 0000 wIndex 0081 wLength 2
>>  956:     irq/23-dwc3-1765  [000] d..1   214.138791: dwc3_ctrl_req: bRequestType 02 bRequest 01 wValue 0000 wIndex 0081 wLength 0
>> 1031:     irq/23-dwc3-1765  [000] d..1   214.140214: dwc3_ctrl_req: bRequestType 82 bRequest 00 wValue 0000 wIndex 0081 wLength 2
>> 1058:     irq/23-dwc3-1765  [000] d..1   214.140618: dwc3_ctrl_req: bRequestType 02 bRequest 01 wValue 0000 wIndex 0081 wLength 0
>> 1133:     irq/23-dwc3-1765  [000] d..1   214.142208: dwc3_ctrl_req: bRequestType 82 bRequest 00 wValue 0000 wIndex 0081 wLength 2
>> 1160:     irq/23-dwc3-1765  [000] d..1   214.142587: dwc3_ctrl_req: bRequestType 02 bRequest 01 wValue 0000 wIndex 0081 wLength 0
>> 
>> Note that we don't have any "bRequest 03" substring.
>
> Is it normal to not see any SetFeature commands? I think I was expecting
> to see them pretty frequently.

Totally normal :-) A host is not required to allow for device-initiated
U1/U2 entry. Nor is it required to halt and endpoint, for example.

>> >> > I've attached the traces and dmesg output in an archive titled
>> >> > "gnarbox-ep0out-failure.tar.xz". Please let me know if you need more
>> >> > information about this particular issue.
>> >> 
>> >> I'll go over the traces, this could be something that has been fixed in
>> >> the past.
>> >
>> > Did you get a chance to take a look at the tracepoints for this error?
>> > No worries if you haven't gotten a chance yet.
>> 
>> I did, nothing screaming wrong with them. That "can't enable ep0out"
>> still puzzles me a bit. What we see on tracepoints is a time out:
>> 
>> 	modprobe-1281  [002] d..1   246.504043: dwc3_gadget_ep_cmd: ep0out: cmd 'Set Endpoint Transfer Resource' [2] params 00000001 00000000 00000000 --> status: Timed Out
>> 
>> But I have no idea why that happens. It could be that this was fixed
>> long ago, but we won't know unless we run a recent kernel on your
>> setup. I, certainly, haven't seen that :-)
>
> One of my colleagues is currently testing kernel 5.0.5 without the patch
> you suggested and it looks pretty good so far! We'll keep you updated if
> this changes. With this issue in particular, it is still unclear how to
> get our device into this state. We'll definitely keep an eye out for it
> on 5.0.5.

If 5.0.5 is working, then you could run a git bisect and find the commit
that fixed it. If we have a failing commit and a working commit, git
bisect will help finding the fix, then you could simply cherry-pick it.

Best regards

-- 
balbi

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux