Re: Mass Storage Gadget Kthread

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Fri, 2 Oct 2015 14:57:54 -0400 (EDT)

On Fri, 2 Oct 2015, Felipe Balbi wrote:

> > Figure [1] is misleading.  The 7 POLLs you see are at the very start of 
> > a READ.  It's not surprising that the host can poll 7 times before the 
> > gadget manages to read the first block of data from the backing file.  
> > This isn't a case where the kthread could have been doing something 
> > useful instead of waiting around to hear from the host.
> > 
> > Figure [3] is difficult to interpret because it doesn't include the 
> > transfer lengths.  I can't tell what was going on during the 37 + 50 
> > POLLs before the first OUT transfer.  If that was a CDB starting a new
> 
> yeah, sorry about that. The 37 + 50 POLLs out is a CBW (31-byte). Before those
> we had a CSW.

Okay.  I don't know why there was such a long delay -- more than 1 ms
since there are 11 SOF packets.  A context switch doesn't take that
long.  In any case, the kthread submits the usb_request for the next 
CBW without waiting for the previous CSW to complete (although it does 
have to wait for the IN transfer preceding the CSW to complete if 
you're using only two I/O buffers).

Note that in Figure 1, there was no delay between the CSW and the
following CBW.

How does the throughput change if you increase the num_buffers module 
parameter?

> > > On figure two we can see that on this particular session, I had as much as 15%
> > > of the bandwidth wasted on POLLs. With this current setup I'm 34MB/sec and with
> > > the added 15% that would get really close to 40MB/sec.
> > 
> > So high speed, right?  Are the numbers in the figure handshake _counts_
> 
> counts
> 
> > or handshake _times_?  A simple NAK doesn't use much bandwidth.  Even
> > if 15% of the handshakes are NAKs, it doesn't mean you're wasting 15%
> > of the bandwidth.
> 
> sure it means. Given a uFrame, I can stuff (theoretically) 13 bulk transactions
> in it.

13 512-byte bulk transactions.

> If I have a token (IN/OUT) which gets a NAK, that's one less transaction
> I can stuff in this uFrame, right ?

No, because a NAKed IN transaction doesn't transfer 512 bytes.  
There's the IN token and the NAK handshake, but no DATAx packet.  You
can fit several of those in the same uframe with 13 512-byte transfers.  
In fact, the second-to-last line in Table 5-10 of the USB-2 spec shows
that with 13 512-byte transfers, there still are 129 bytes of bus time
available in a uframe.

A NAKed bulk-OUT transfer would indeed uselessly transfer 512 data
bytes.  But they (almost) never occur in high-speed connections;  
instead the controllers exchange PING and NYET packets.

> Note that when I have the 37 POLLs, you can see 8 SOFs. This is just the sniffer
> SW trying to aggregate SOFs (I should drop that aggregation, it's really annoying).

Does the software let you do that?  Last time I checked the TotalPhase
Control Center program, there were some things it would not do at all.  
Separating out all the components of a split transaction, for example.

> > > So the question is, why do we have to wait for that kthread to get scheduled ?
> > > Why couldn't we skip it completely ? Is there really anything left in there that
> > > couldn't be done from within usb_request->complete() itself ?
> > 
> > The real answer is the calls to vfs_read() and vfs_write() -- those 
> > have to occur in process context.
> 
> would a threaded IRQ handler be enough ? Fort he sake of argument, let's assume
> that all UDC drivers use threaded irqs and they are properly masking their IRQs
> in in the top half for the bottom half (the IRQ thread) to run.
> 
> This means, unless I'm missing something, that we could switch a chunk of the
> gadget framework to mutexes instead of spin locks (not all of it though, but
> let's ignore that side for now). In that case, we could safely remove spin locks
> from ->complete() and use mutexes and we then we wouldn't need this extra
> kthread at all, right ?

Well, for sure you wouldn't need the kthread.  It's not clear whether
this would be an advantage, though.  Now the CPU would have to schedule
the bottom-half IRQ thread instead of scheduling the kthread, so the
total number of context switches would be the same.

In fact, if you're using an SMP system then moving to threaded IRQs is
likely to make things worse.  A bottom-half handler can run on only one
CPU at a time, whereas in the current code the entire IRQ handler can
run in parallel with the kthread on a different CPU.  In other words, 
moving to threaded IRQs would serialize all the bottom-half processing 
in the UDC driver with the work done in the gadget's kthread.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html