Hi,
Wesley Cheng <wcheng@xxxxxxxxxxxxxx> writes:
to be honest, I don't think these should go in (apart from the build
failure) because it's likely to break instantiations of the core with
differing FIFO sizes. Some instantiations even have some endpoints with
dedicated functionality that requires the default FIFO size configured
during coreConsultant instantiation. I know of at OMAP5 and some Intel
implementations which have dedicated endpoints for processor tracing.
With OMAP5, these endpoints are configured at the top of the available
endpoints, which means that if a gadget driver gets loaded and takes
over most of the FIFO space because of this resizing, processor tracing
will have a hard time running. That being said, processor tracing isn't
supported in upstream at this moment.
I agree that the application of this logic may differ between vendors,
hence why I wanted to keep this controllable by the DT property, so that
for those which do not support this use case can leave it disabled. The
logic is there to ensure that for a given USB configuration, for each EP
it would have at least 1 TX FIFO. For USB configurations which don't
utilize all available IN EPs, it would allow re-allocation of internal
memory to EPs which will actually be in use.
The feature ends up being all-or-nothing, then :-) It sounds like we can
be a little nicer in this regard.
Don't get me wrong, I think once those features become available
upstream, we can improve the logic. From what I remember when looking
sure, I support that. But I want to make sure the first cut isn't likely
to break things left and right :)
Hence, let's at least get more testing.
Sure, I'd hope that the other users of DWC3 will also see some pretty
big improvements on the TX path with this.
fingers crossed
at Andy Shevchenko's Github, the Intel tracer downstream changes were
just to remove physical EP1 and 2 from the DWC3 endpoint list. If that
right, that's the reason why we introduced the endpoint feature
flags. The end goal was that the UDC would be able to have custom
feature flags paired with ->validate_endpoint() or whatever before
allowing it to be enabled. Then the UDC driver could tell UDC core to
skip that endpoint on that particular platform without interefering with
everything else.
Of course, we still need to figure out a way to abstract the different
dwc3 instantiations.
was the change which ended up upstream for the Intel tracer then we
could improve the logic to avoid re-sizing those particular EPs.
The problem then, just as I mentioned in the previous paragraph, will be
coming up with a solution that's elegant and works for all different
instantiations of dwc3 (or musb, cdns3, etc).
Well, at least for the TX FIFO resizing logic, we'd only be needing to
focus on the DWC3 implementation.
You bring up another good topic that I'll eventually needing to be
taking a look at, which is a nice way we can handle vendor specific
endpoints and how they can co-exist with other "normal" endpoints. We
have a few special HW eps as well, which we try to maintain separately
in our DWC3 vendor driver, but it isn't the most convenient, or most
pretty method :).
Awesome, as mentioned, the endpoint feature flags were added exactly to
allow for these vendor-specific features :-)
I'm more than happy to help testing now that I finally got our SM8150
Surface Duo device tree accepted by Bjorn ;-)
However, I'm not sure how the changes would look like in the end, so I
would like to wait later down the line to include that :).
Fair enough, I agree. Can we get some more testing of $subject, though?
Did you test $subject with upstream too? Which gadget drivers did you
use? How did you test
The results that I included in the cover page was tested with the pure
upstream kernel on our device. Below was using the ConfigFS gadget w/ a
mass storage only composition.
Test Parameters:
- Platform: Qualcomm SM8150
- bMaxBurst = 6
- USB req size = 256kB
- Num of USB reqs = 16
do you mind testing with the regular request size (16KiB) and 250
requests? I think we can even do 15 bursts in that case.
- USB Speed = Super-Speed
- Function Driver: Mass Storage (w/ ramdisk)
- Test Application: CrystalDiskMark
Results:
TXFIFO Depth = 3 max packets
Test Case | Data Size | AVG tput (in MB/s)
-------------------------------------------
Sequential|1 GB x |
Read |9 loops | 193.60
| | 195.86
| | 184.77
| | 193.60
-------------------------------------------
TXFIFO Depth = 6 max packets
Test Case | Data Size | AVG tput (in MB/s)
-------------------------------------------
Sequential|1 GB x |
Read |9 loops | 287.35
| | 304.94
| | 289.64
| | 293.61
I remember getting close to 400MiB/sec with Intel platforms without
resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
memory could be failing.
Then again, I never ran with CrystalDiskMark, I was using my own tool
(it's somewhere in github. If you care, I can look up the URL).
We also have internal numbers which have shown similar improvements as
well. Those are over networking/tethering interfaces, so testing IPERF
loopback over TCP/UDP.
loopback iperf? That would skip the wire, no?
size of 2 and TX threshold of 1, this would really be not beneficial to
us, because we can only change the TX threshold to 2 at max, and at
least in my observations, once we have to go out to system memory to
fetch the next data packet, that latency takes enough time for the
controller to end the current burst.
What I noticed with g_mass_storage is that we can amortize the cost of
fetching data from memory, with a deeper request queue. Whenever I
test(ed) g_mass_storage, I was doing so with 250 requests. And that was
enough to give me very good performance. Never had to poke at TX FIFO
resizing. Did you try something like this too?
I feel that allocating more requests is a far simpler and more generic
method that changing FIFO sizes :)
I wish I had a USB bus trace handy to show you, which would make it very
clear how the USB bus is currently utilized with TXFIFO size 2 vs 6. So
by increasing the number of USB requests, that will help if there was a
bottleneck at the SW level where the application/function driver
utilizing the DWC3 was submitting data much faster than the HW was
processing them.
So yes, this method of increasing the # of USB reqs will definitely help
with situations such as HSUSB or in SSUSB when EP bursting isn't used.
The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
bursting.
Hmm, that's not what I remember. Perhaps the TRB cache size plays a role
here too. I have clear memories of testing this very scenario of
bursting (using g_mass_storage at the time) because I was curious about
it. Back then, my tests showed no difference in behavior.
It could be nice if Heikki could test Intel parts with and without your
changes on g_mass_storage with 250 requests.