Hi, Peter Chen <peter.chen@xxxxxxxxxxxxx> writes: > On Fri, Oct 02, 2015 at 11:05:28AM -0500, Felipe Balbi wrote: >> Hi Alan, >> >> here's a question which I hope you can help me understand :-) >> >> Why do we have that kthread for the mass storage gadgets ? I noticed a very >> interesting behavior. >> >> Basically, the MSC class works in a loop like so: >> >> CBW >> Data Transfer >> CSW >> >> In our implemention, what we do is: >> >> CBW >> wake_up_process() >> Data Transfer >> wake_up_process() >> CSW >> wake_up_process() >> >> Now here's the interesting bit. Every time we wake_up_process(), we basically >> don't do anything until MSC's kthread gets finally scheduled and has a chance of >> doing its job. This means that the host keeps sending us tokens but the UDC >> doesn't have any request queued to start a transfer. This happens specially with >> IN endpoints, not so much on OUT directions. See figure one [1] we can see that >> host issues over 7 POLLs before UDC has finally started a usb_request, sometimes >> this goes for even longer (see image [3]). >> >> On figure two we can see that on this particular session, I had as much as 15% >> of the bandwidth wasted on POLLs. With this current setup I'm 34MB/sec and with >> the added 15% that would get really close to 40MB/sec. >> >> So the question is, why do we have to wait for that kthread to get scheduled ? >> Why couldn't we skip it completely ? Is there really anything left in there that >> couldn't be done from within usb_request->complete() itself ? >> >> I'll spend some time on that today and really dig that thing up, but if you know >> the answer off the top of your head, I'd be happy to hear. >> > > To get the best performance, you may try to see as least IN-NAKs > or NYET-PINGs as possible within uFrame, the software should > always queue the enough requests, and hardware FIFO should always > be ready before host sends the token. > > From my test (usbtest & g_loopback), with the best situations, > the chipidea hardware can get 10 transactions for OUT and 11 > transactions for IN within uFrame, and chipidea leaves a little more > space before end of SoF for last transaction, so the other hardware > may get 11 transactions for OUT and 12 transactions for IN within > uFrame (I see it at Intel platform), so from what I see, the best > performance at Linux for bulk is 44MB for OUT and 48MB for IN. that's all clear :-) The point is that I _do_ get a ton of PINGs because the gadget doesn't queue requests fast enough and I'm, currently, blaming the latency of the kthread() itself, though I haven't been successful at proving that statement thus far. -- balbi
Attachment:
signature.asc
Description: PGP signature