Hi, On Fri, Oct 02, 2015 at 02:57:54PM -0400, Alan Stern wrote: > On Fri, 2 Oct 2015, Felipe Balbi wrote: > > > > Figure [1] is misleading. The 7 POLLs you see are at the very start of > > > a READ. It's not surprising that the host can poll 7 times before the > > > gadget manages to read the first block of data from the backing file. > > > This isn't a case where the kthread could have been doing something > > > useful instead of waiting around to hear from the host. > > > > > > Figure [3] is difficult to interpret because it doesn't include the > > > transfer lengths. I can't tell what was going on during the 37 + 50 > > > POLLs before the first OUT transfer. If that was a CDB starting a new > > > > yeah, sorry about that. The 37 + 50 POLLs out is a CBW (31-byte). Before those > > we had a CSW. > > Okay. I don't know why there was such a long delay -- more than 1 ms > since there are 11 SOF packets. A context switch doesn't take that > long. In any case, the kthread submits the usb_request for the next > CBW without waiting for the previous CSW to complete (although it does > have to wait for the IN transfer preceding the CSW to complete if > you're using only two I/O buffers). > > Note that in Figure 1, there was no delay between the CSW and the > following CBW. > > How does the throughput change if you increase the num_buffers module > parameter? I was using 32, IIRC. I'll try again just to make sure. > > > > On figure two we can see that on this particular session, I had as much as 15% > > > > of the bandwidth wasted on POLLs. With this current setup I'm 34MB/sec and with > > > > the added 15% that would get really close to 40MB/sec. > > > > > > So high speed, right? Are the numbers in the figure handshake _counts_ > > > > counts > > > > > or handshake _times_? A simple NAK doesn't use much bandwidth. Even > > > if 15% of the handshakes are NAKs, it doesn't mean you're wasting 15% > > > of the bandwidth. > > > > sure it means. Given a uFrame, I can stuff (theoretically) 13 bulk transactions > > in it. > > 13 512-byte bulk transactions. I read it as 13 up to 512-byte bulk transactions, no ? I mean, we can stuff 26 256-byte bulk transactions :-) It's either a full packet or a short-packet which terminates the transfer. > > If I have a token (IN/OUT) which gets a NAK, that's one less transaction > > I can stuff in this uFrame, right ? > > No, because a NAKed IN transaction doesn't transfer 512 bytes. > There's the IN token and the NAK handshake, but no DATAx packet. You > can fit several of those in the same uframe with 13 512-byte transfers. > In fact, the second-to-last line in Table 5-10 of the USB-2 spec shows > that with 13 512-byte transfers, there still are 129 bytes of bus time > available in a uframe. > > A NAKed bulk-OUT transfer would indeed uselessly transfer 512 data > bytes. But they (almost) never occur in high-speed connections; > instead the controllers exchange PING and NYET packets. right. > > Note that when I have the 37 POLLs, you can see 8 SOFs. This is just the sniffer > > SW trying to aggregate SOFs (I should drop that aggregation, it's really annoying). > > Does the software let you do that? Last time I checked the TotalPhase > Control Center program, there were some things it would not do at all. > Separating out all the components of a split transaction, for example. I'll check. There's a way to tell it you want sequential sniffing, and never aggregate. > > > > So the question is, why do we have to wait for that kthread to get scheduled ? > > > > Why couldn't we skip it completely ? Is there really anything left in there that > > > > couldn't be done from within usb_request->complete() itself ? > > > > > > The real answer is the calls to vfs_read() and vfs_write() -- those > > > have to occur in process context. > > > > would a threaded IRQ handler be enough ? Fort he sake of argument, let's assume > > that all UDC drivers use threaded irqs and they are properly masking their IRQs > > in in the top half for the bottom half (the IRQ thread) to run. > > > > This means, unless I'm missing something, that we could switch a chunk of the > > gadget framework to mutexes instead of spin locks (not all of it though, but > > let's ignore that side for now). In that case, we could safely remove spin locks > > from ->complete() and use mutexes and we then we wouldn't need this extra > > kthread at all, right ? > > Well, for sure you wouldn't need the kthread. It's not clear whether > this would be an advantage, though. Now the CPU would have to schedule > the bottom-half IRQ thread instead of scheduling the kthread, so the > total number of context switches would be the same. we're already using threaded IRQs in some UDC drivers. > In fact, if you're using an SMP system then moving to threaded IRQs is > likely to make things worse. A bottom-half handler can run on only one > CPU at a time, whereas in the current code the entire IRQ handler can > run in parallel with the kthread on a different CPU. In other words, > moving to threaded IRQs would serialize all the bottom-half processing > in the UDC driver with the work done in the gadget's kthread. aaaaa, good point :-) Guess I'll continue profiling to figure out why we take so long to get scheduled, but chances are there won't be much we can do, apparently. -- balbi
Attachment:
signature.asc
Description: PGP signature