Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting

Wesley Cheng <wcheng@xxxxxxxxxxxxxx> · Mon, 14 Jun 2021 21:22:49 -0700

On 6/14/2021 12:30 PM, Ferry Toth wrote:
> 
> Op 14-06-2021 om 20:58 schreef Wesley Cheng:
>>
>> On 6/12/2021 2:27 PM, Ferry Toth wrote:
>>> Hi
>>>
>>> Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
>>>> On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
>>>> <heikki.krogerus@xxxxxxxxxxxxxxx> wrote:
>>>>> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Wesley Cheng <wcheng@xxxxxxxxxxxxxx> writes:
>>>>>>>>>>>>> to be honest, I don't think these should go in (apart from
>>>>>>>>>>>>> the build
>>>>>>>>>>>>> failure) because it's likely to break instantiations of the
>>>>>>>>>>>>> core with
>>>>>>>>>>>>> differing FIFO sizes. Some instantiations even have some
>>>>>>>>>>>>> endpoints with
>>>>>>>>>>>>> dedicated functionality that requires the default FIFO size
>>>>>>>>>>>>> configured
>>>>>>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and
>>>>>>>>>>>>> some Intel
>>>>>>>>>>>>> implementations which have dedicated endpoints for processor
>>>>>>>>>>>>> tracing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With OMAP5, these endpoints are configured at the top of the
>>>>>>>>>>>>> available
>>>>>>>>>>>>> endpoints, which means that if a gadget driver gets loaded
>>>>>>>>>>>>> and takes
>>>>>>>>>>>>> over most of the FIFO space because of this resizing,
>>>>>>>>>>>>> processor tracing
>>>>>>>>>>>>> will have a hard time running. That being said, processor
>>>>>>>>>>>>> tracing isn't
>>>>>>>>>>>>> supported in upstream at this moment.
>>>>>>>>>>>>>
>>>>>>>>>>> I agree that the application of this logic may differ between
>>>>>>>>>>> vendors,
>>>>>>>>>>> hence why I wanted to keep this controllable by the DT
>>>>>>>>>>> property, so that
>>>>>>>>>>> for those which do not support this use case can leave it
>>>>>>>>>>> disabled.  The
>>>>>>>>>>> logic is there to ensure that for a given USB configuration,
>>>>>>>>>>> for each EP
>>>>>>>>>>> it would have at least 1 TX FIFO.  For USB configurations which
>>>>>>>>>>> don't
>>>>>>>>>>> utilize all available IN EPs, it would allow re-allocation of
>>>>>>>>>>> internal
>>>>>>>>>>> memory to EPs which will actually be in use.
>>>>>>>>>> The feature ends up being all-or-nothing, then :-) It sounds
>>>>>>>>>> like we can
>>>>>>>>>> be a little nicer in this regard.
>>>>>>>>>>
>>>>>>>>> Don't get me wrong, I think once those features become available
>>>>>>>>> upstream, we can improve the logic.  From what I remember when
>>>>>>>>> looking
>>>>>>>> sure, I support that. But I want to make sure the first cut isn't
>>>>>>>> likely
>>>>>>>> to break things left and right :)
>>>>>>>>
>>>>>>>> Hence, let's at least get more testing.
>>>>>>>>
>>>>>>> Sure, I'd hope that the other users of DWC3 will also see some
>>>>>>> pretty
>>>>>>> big improvements on the TX path with this.
>>>>>> fingers crossed
>>>>>>
>>>>>>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes
>>>>>>>>> were
>>>>>>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.
>>>>>>>>> If that
>>>>>>>> right, that's the reason why we introduced the endpoint feature
>>>>>>>> flags. The end goal was that the UDC would be able to have custom
>>>>>>>> feature flags paired with ->validate_endpoint() or whatever before
>>>>>>>> allowing it to be enabled. Then the UDC driver could tell UDC
>>>>>>>> core to
>>>>>>>> skip that endpoint on that particular platform without
>>>>>>>> interefering with
>>>>>>>> everything else.
>>>>>>>>
>>>>>>>> Of course, we still need to figure out a way to abstract the
>>>>>>>> different
>>>>>>>> dwc3 instantiations.
>>>>>>>>
>>>>>>>>> was the change which ended up upstream for the Intel tracer
>>>>>>>>> then we
>>>>>>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>>>>>> The problem then, just as I mentioned in the previous paragraph,
>>>>>>>> will be
>>>>>>>> coming up with a solution that's elegant and works for all
>>>>>>>> different
>>>>>>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>>>>>>
>>>>>>> Well, at least for the TX FIFO resizing logic, we'd only be
>>>>>>> needing to
>>>>>>> focus on the DWC3 implementation.
>>>>>>>
>>>>>>> You bring up another good topic that I'll eventually needing to be
>>>>>>> taking a look at, which is a nice way we can handle vendor specific
>>>>>>> endpoints and how they can co-exist with other "normal"
>>>>>>> endpoints.  We
>>>>>>> have a few special HW eps as well, which we try to maintain
>>>>>>> separately
>>>>>>> in our DWC3 vendor driver, but it isn't the most convenient, or most
>>>>>>> pretty method :).
>>>>>> Awesome, as mentioned, the endpoint feature flags were added
>>>>>> exactly to
>>>>>> allow for these vendor-specific features :-)
>>>>>>
>>>>>> I'm more than happy to help testing now that I finally got our SM8150
>>>>>> Surface Duo device tree accepted by Bjorn ;-)
>>>>>>
>>>>>>>>> However, I'm not sure how the changes would look like in the end,
>>>>>>>>> so I
>>>>>>>>> would like to wait later down the line to include that :).
>>>>>>>> Fair enough, I agree. Can we get some more testing of $subject,
>>>>>>>> though?
>>>>>>>> Did you test $subject with upstream too? Which gadget drivers
>>>>>>>> did you
>>>>>>>> use? How did you test
>>>>>>>>
>>>>>>> The results that I included in the cover page was tested with the
>>>>>>> pure
>>>>>>> upstream kernel on our device.  Below was using the ConfigFS gadget
>>>>>>> w/ a
>>>>>>> mass storage only composition.
>>>>>>>
>>>>>>> Test Parameters:
>>>>>>>    - Platform: Qualcomm SM8150
>>>>>>>    - bMaxBurst = 6
>>>>>>>    - USB req size = 256kB
>>>>>>>    - Num of USB reqs = 16
>>>>>> do you mind testing with the regular request size (16KiB) and 250
>>>>>> requests? I think we can even do 15 bursts in that case.
>>>>>>
>>>>>>>    - USB Speed = Super-Speed
>>>>>>>    - Function Driver: Mass Storage (w/ ramdisk)
>>>>>>>    - Test Application: CrystalDiskMark
>>>>>>>
>>>>>>> Results:
>>>>>>>
>>>>>>> TXFIFO Depth = 3 max packets
>>>>>>>
>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>> -------------------------------------------
>>>>>>> Sequential|1 GB x     |
>>>>>>> Read      |9 loops    | 193.60
>>>>>>>             |           | 195.86
>>>>>>>             |           | 184.77
>>>>>>>             |           | 193.60
>>>>>>> -------------------------------------------
>>>>>>>
>>>>>>> TXFIFO Depth = 6 max packets
>>>>>>>
>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>> -------------------------------------------
>>>>>>> Sequential|1 GB x     |
>>>>>>> Read      |9 loops    | 287.35
>>>>>>>           |           | 304.94
>>>>>>>             |           | 289.64
>>>>>>>             |           | 293.61
>>>>>> I remember getting close to 400MiB/sec with Intel platforms without
>>>>>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024,
>>>>>> though my
>>>>>> memory could be failing.
>>>>>>
>>>>>> Then again, I never ran with CrystalDiskMark, I was using my own tool
>>>>>> (it's somewhere in github. If you care, I can look up the URL).
>>>>>>
>>>>>>> We also have internal numbers which have shown similar
>>>>>>> improvements as
>>>>>>> well.  Those are over networking/tethering interfaces, so testing
>>>>>>> IPERF
>>>>>>> loopback over TCP/UDP.
>>>>>> loopback iperf? That would skip the wire, no?
>>>>>>
>>>>>>>>> size of 2 and TX threshold of 1, this would really be not
>>>>>>>>> beneficial to
>>>>>>>>> us, because we can only change the TX threshold to 2 at max,
>>>>>>>>> and at
>>>>>>>>> least in my observations, once we have to go out to system
>>>>>>>>> memory to
>>>>>>>>> fetch the next data packet, that latency takes enough time for the
>>>>>>>>> controller to end the current burst.
>>>>>>>> What I noticed with g_mass_storage is that we can amortize the
>>>>>>>> cost of
>>>>>>>> fetching data from memory, with a deeper request queue. Whenever I
>>>>>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And
>>>>>>>> that was
>>>>>>>> enough to give me very good performance. Never had to poke at TX
>>>>>>>> FIFO
>>>>>>>> resizing. Did you try something like this too?
>>>>>>>>
>>>>>>>> I feel that allocating more requests is a far simpler and more
>>>>>>>> generic
>>>>>>>> method that changing FIFO sizes :)
>>>>>>>>
>>>>>>> I wish I had a USB bus trace handy to show you, which would make it
>>>>>>> very
>>>>>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs
>>>>>>> 6.  So
>>>>>>> by increasing the number of USB requests, that will help if there
>>>>>>> was a
>>>>>>> bottleneck at the SW level where the application/function driver
>>>>>>> utilizing the DWC3 was submitting data much faster than the HW was
>>>>>>> processing them.
>>>>>>>
>>>>>>> So yes, this method of increasing the # of USB reqs will definitely
>>>>>>> help
>>>>>>> with situations such as HSUSB or in SSUSB when EP bursting isn't
>>>>>>> used.
>>>>>>> The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
>>>>>>> bursting.
>>>>>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a
>>>>>> role
>>>>>> here too. I have clear memories of testing this very scenario of
>>>>>> bursting (using g_mass_storage at the time) because I was curious
>>>>>> about
>>>>>> it. Back then, my tests showed no difference in behavior.
>>>>>>
>>>>>> It could be nice if Heikki could test Intel parts with and without
>>>>>> your
>>>>>> changes on g_mass_storage with 250 requests.
>>>>> Andy, you have a system at hand that has the DWC3 block enabled,
>>>>> right? Can you help out here?
>>>> I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
>>>> more test cases (I have only one or two) and maybe can help. But I'll
>>>> keep this in mind.
>>> I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
>>> apply. Switching between host/gadget works, no connections dropping, no
>>> errors in dmesg.
>>>
>>> In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
>>> composite device created from configfs with gser / eem / mass_storage /
>>> uac2.
>>>
>>> Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
>>> (207Mbits/sec) mode. Compared to v5.10.41 without patches host
>>> (93.4Mbits/sec) and gadget (198Mbits/sec).
>>>
>>> Gadget seems to be a little faster with the patches, but that might also
>>> be caused  by something else, on v5.10.41 I see the bitrate bouncing
>>> between 207 and 199.
>>>
>>> I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec. With
>>> v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
>>>
>>> With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
>>> 34.7MB/s. This might be limited by Edison's internal eMMC speed (when
>>> booting U-Boot reads the kernel with 21.4 MiB/s).
>>>
>> Hi Ferry,
>>
>> Thanks for the testing.  Just to double check, did you also enable the
>> property, which enabled the TXFIFO resize feature on the platform?  For
>> example, for the QCOM SM8150 platform, we're adding the following to our
>> device tree node:
>>
>> tx-fifo-resize
>>
>> If not, then your results at least confirms that w/o the property
>> present, the changes won't break anything :).  Thanks again for the
>> initial testing!
> 
> No I didn't. Afaik we don't have a devicetree property to set.
> 
> But I'd be happy to test that as well. But where to set the property?
> 
> dwc3_pci_mrfld_properties[] in dwc3-pci?
> 
Hi Ferry,

Not too sure which DWC3 driver is used for the Intel platform, but I
believe that should be the one. (if that's what is normally used)  We'd
just need to add an entry w/ the below:

PROPERTY_ENTRY_BOOL("tx-fifo-resize")

Thanks
Wesley Cheng

>> Thanks
>> Wesley Cheng
>>
>>>>>>> Now with endpoint bursting, if the function notifies the host that
>>>>>>> bursting is supported, when the host sends the ACK for the Data
>>>>>>> Packet,
>>>>>>> it should have a NumP value equal to the bMaxBurst reported in
>>>>>>> the EP
>>>>>> Yes and no. Looking back at the history, we used to configure NUMP
>>>>>> based
>>>>>> on bMaxBurst, but it was changed later in commit
>>>>>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
>>>>>> problem reported by John Youn.
>>>>>>
>>>>>> And now we've come full circle. Because even if I believe more
>>>>>> requests
>>>>>> are enough for bursting, NUMP is limited by the RxFIFO size. This
>>>>>> ends
>>>>>> up supporting your claim that we need RxFIFO resizing if we want to
>>>>>> squeeze more throughput out of the controller.
>>>>>>
>>>>>> However, note that this is about RxFIFO size, not TxFIFO size. In
>>>>>> fact,
>>>>>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about
>>>>>> NumP
>>>>>> (emphasis is mine):
>>>>>>
>>>>>>         "Number of Packets (NumP). This field is used to indicate the
>>>>>>         number of Data Packet buffers that the **receiver** can
>>>>>>         accept. The value in this field shall be less than or
>>>>>> equal to
>>>>>>         the maximum burst size supported by the endpoint as
>>>>>> determined
>>>>>>         by the value in the bMaxBurst field in the Endpoint Companion
>>>>>>         Descriptor (refer to Section 9.6.7)."
>>>>>>
>>>>>> So, NumP is for the receiver, not the transmitter. Could you clarify
>>>>>> what you mean here?
>>>>>>
>>>>>> /me keeps reading
>>>>>>
>>>>>> Hmm, table 8-15 tries to clarify:
>>>>>>
>>>>>>         "Number of Packets (NumP).
>>>>>>
>>>>>>         For an OUT endpoint, refer to Table 8-13 for the
>>>>>> description of
>>>>>>         this field.
>>>>>>
>>>>>>         For an IN endpoint this field is set by the endpoint to the
>>>>>>         number of packets it can transmit when the host resumes
>>>>>>         transactions to it. This field shall not have a value greater
>>>>>>         than the maximum burst size supported by the endpoint as
>>>>>>         indicated by the value in the bMaxBurst field in the Endpoint
>>>>>>         Companion Descriptor. Note that the value reported in this
>>>>>> field
>>>>>>         may be treated by the host as informative only."
>>>>>>
>>>>>> However, if I remember correctly (please verify dwc3 databook),
>>>>>> NUMP in
>>>>>> DCFG was only for receive buffers. Thin, John, how does dwc3 compute
>>>>>> NumP for TX/IN endpoints? Is that computed as a function of
>>>>>> DCFG.NUMP or
>>>>>> TxFIFO size?
>>>>>>
>>>>>>> desc.  If we have a TXFIFO size of 2, then normally what I have
>>>>>>> seen is
>>>>>>> that after 2 data packets, the device issues a NRDY.  So then we'd
>>>>>>> need
>>>>>>> to send an ERDY once data is available within the FIFO, and the same
>>>>>>> sequence happens until the USB request is complete.  With this
>>>>>>> constant
>>>>>>> NRDY/ERDY handshake going on, you actually see that the bus is under
>>>>>>> utilized.  When we increase an EP's FIFO size, then you'll see
>>>>>>> constant
>>>>>>> bursts for a request, until the request is done, or if the host
>>>>>>> runs out
>>>>>>> of RXFIFO. (ie no interruption [on the USB protocol level] during
>>>>>>> USB
>>>>>>> request data transfer)
>>>>>> Unfortunately I don't have access to a USB sniffer anymore :-(
>>>>>>
>>>>>>>>>>>> Good points.
>>>>>>>>>>>>
>>>>>>>>>>>> Wesley, what kind of testing have you done on this on
>>>>>>>>>>>> different devices?
>>>>>>>>>>>>
>>>>>>>>>>> As mentioned above, these changes are currently present on end
>>>>>>>>>>> user
>>>>>>>>>>> devices for the past few years, so its been through a lot of
>>>>>>>>>>> testing :).
>>>>>>>>>> all with the same gadget driver. Also, who uses USB on android
>>>>>>>>>> devices
>>>>>>>>>> these days? Most of the data transfer goes via WiFi or
>>>>>>>>>> Bluetooth, anyway
>>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>> I guess only developers are using USB during development to
>>>>>>>>>> flash dev
>>>>>>>>>> images heh.
>>>>>>>>>>
>>>>>>>>> I used to be a customer facing engineer, so honestly I did see
>>>>>>>>> some
>>>>>>>>> really interesting and crazy designs.  Again, we do have
>>>>>>>>> non-Android
>>>>>>>>> products that use the same code, and it has been working in there
>>>>>>>>> for a
>>>>>>>>> few years as well.  The TXFIFO sizing really has helped with
>>>>>>>>> multimedia
>>>>>>>>> use cases, which use isoc endpoints, since esp. in those lower
>>>>>>>>> end CPU
>>>>>>>>> chips where latencies across the system are much larger, and a
>>>>>>>>> missed
>>>>>>>>> ISOC interval leads to a pop in your ear.
>>>>>>>> This is good background information. Thanks for bringing this
>>>>>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm
>>>>>>>> interested in
>>>>>>>> knowing if a deeper request queue would also help here.
>>>>>>>>
>>>>>>>> Remember dwc3 can accomodate 255 requests + link for each
>>>>>>>> endpoint. If
>>>>>>>> our gadget driver uses a low number of requests, we're never really
>>>>>>>> using the TRB ring in our benefit.
>>>>>>>>
>>>>>>> We're actually using both a deeper USB request queue + TX fifo
>>>>>>> resizing. :).
>>>>>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX
>>>>>> Burst
>>>>>> behavior.
>>>>> -- 
>>>>> heikki
>>>>

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project