Re: Mass Storage Gadget Kthread

Felipe Balbi <balbi@xxxxxx> · Fri, 2 Oct 2015 14:16:53 -0500

Hi,

On Fri, Oct 02, 2015 at 02:57:54PM -0400, Alan Stern wrote:
> On Fri, 2 Oct 2015, Felipe Balbi wrote:
> 
> > > Figure [1] is misleading.  The 7 POLLs you see are at the very start of 
> > > a READ.  It's not surprising that the host can poll 7 times before the 
> > > gadget manages to read the first block of data from the backing file.  
> > > This isn't a case where the kthread could have been doing something 
> > > useful instead of waiting around to hear from the host.
> > > 
> > > Figure [3] is difficult to interpret because it doesn't include the 
> > > transfer lengths.  I can't tell what was going on during the 37 + 50 
> > > POLLs before the first OUT transfer.  If that was a CDB starting a new
> > 
> > yeah, sorry about that. The 37 + 50 POLLs out is a CBW (31-byte). Before those
> > we had a CSW.
> 
> Okay.  I don't know why there was such a long delay -- more than 1 ms
> since there are 11 SOF packets.  A context switch doesn't take that
> long.  In any case, the kthread submits the usb_request for the next 
> CBW without waiting for the previous CSW to complete (although it does 
> have to wait for the IN transfer preceding the CSW to complete if 
> you're using only two I/O buffers).
> 
> Note that in Figure 1, there was no delay between the CSW and the
> following CBW.
> 
> How does the throughput change if you increase the num_buffers module 
> parameter?

I was using 32, IIRC. I'll try again just to make sure.

> > > > On figure two we can see that on this particular session, I had as much as 15%
> > > > of the bandwidth wasted on POLLs. With this current setup I'm 34MB/sec and with
> > > > the added 15% that would get really close to 40MB/sec.
> > > 
> > > So high speed, right?  Are the numbers in the figure handshake _counts_
> > 
> > counts
> > 
> > > or handshake _times_?  A simple NAK doesn't use much bandwidth.  Even
> > > if 15% of the handshakes are NAKs, it doesn't mean you're wasting 15%
> > > of the bandwidth.
> > 
> > sure it means. Given a uFrame, I can stuff (theoretically) 13 bulk transactions
> > in it.
> 
> 13 512-byte bulk transactions.

I read it as 13 up to 512-byte bulk transactions, no ? I mean, we can stuff 26
256-byte bulk transactions :-) It's either a full packet or a short-packet which
terminates the transfer.

> > If I have a token (IN/OUT) which gets a NAK, that's one less transaction
> > I can stuff in this uFrame, right ?
> 
> No, because a NAKed IN transaction doesn't transfer 512 bytes.  
> There's the IN token and the NAK handshake, but no DATAx packet.  You
> can fit several of those in the same uframe with 13 512-byte transfers.  
> In fact, the second-to-last line in Table 5-10 of the USB-2 spec shows
> that with 13 512-byte transfers, there still are 129 bytes of bus time
> available in a uframe.
> 
> A NAKed bulk-OUT transfer would indeed uselessly transfer 512 data
> bytes.  But they (almost) never occur in high-speed connections;  
> instead the controllers exchange PING and NYET packets.

right.

> > Note that when I have the 37 POLLs, you can see 8 SOFs. This is just the sniffer
> > SW trying to aggregate SOFs (I should drop that aggregation, it's really annoying).
> 
> Does the software let you do that?  Last time I checked the TotalPhase
> Control Center program, there were some things it would not do at all.  
> Separating out all the components of a split transaction, for example.

I'll check. There's a way to tell it you want sequential sniffing, and never aggregate.

> > > > So the question is, why do we have to wait for that kthread to get scheduled ?
> > > > Why couldn't we skip it completely ? Is there really anything left in there that
> > > > couldn't be done from within usb_request->complete() itself ?
> > > 
> > > The real answer is the calls to vfs_read() and vfs_write() -- those 
> > > have to occur in process context.
> > 
> > would a threaded IRQ handler be enough ? Fort he sake of argument, let's assume
> > that all UDC drivers use threaded irqs and they are properly masking their IRQs
> > in in the top half for the bottom half (the IRQ thread) to run.
> > 
> > This means, unless I'm missing something, that we could switch a chunk of the
> > gadget framework to mutexes instead of spin locks (not all of it though, but
> > let's ignore that side for now). In that case, we could safely remove spin locks
> > from ->complete() and use mutexes and we then we wouldn't need this extra
> > kthread at all, right ?
> 
> Well, for sure you wouldn't need the kthread.  It's not clear whether
> this would be an advantage, though.  Now the CPU would have to schedule
> the bottom-half IRQ thread instead of scheduling the kthread, so the
> total number of context switches would be the same.

we're already using threaded IRQs in some UDC drivers.

> In fact, if you're using an SMP system then moving to threaded IRQs is
> likely to make things worse.  A bottom-half handler can run on only one
> CPU at a time, whereas in the current code the entire IRQ handler can
> run in parallel with the kthread on a different CPU.  In other words, 
> moving to threaded IRQs would serialize all the bottom-half processing 
> in the UDC driver with the work done in the gadget's kthread.

aaaaa, good point :-) Guess I'll continue profiling to figure out why we take so
long to get scheduled, but chances are there won't be much we can do, apparently.

-- 
balbi
Attachment:
signature.asc

Description: PGP signature