Paul, can you tell your email client to wrap lines after 72 columns or so? It would make replying a lot easier... On Fri, 16 Oct 2015, Paul Jones wrote: > >> The slow queuing of transfers seems to be due to waiting in a spin lock on ep->dev->lock (4+ us). > >> I�m guessing this is caused by the fact that there sometimes seems to be significant time between when the spin lock is acquired in net2280_queue and when it is released. > >> Any idea why that may be? > > > > There shouldn't be any contention on that spinlock -- at least, not > > until the driver's interrupt handler needs to run. Everything else in > > the mass-storage gadget is single-threaded. I don't know why it should > > take so long to acquire the spinlock when nothing else is happening, > > unless once again you're seeing interrupts from other devices. > I am talking about the net2280 lock and not the mass_storage lock. So was I. > > One thing you did not show here was the delay between sending the last > > data buffer of one command and sending the first data buffer of the > > next command. That overhead could significantly affect the overall > > throughput. > The next read command is generally handled within 20us after the previous one. > Switched over to ftrace to include the net2280 irq/dma handling, logging of the spin locks and thread handling in bulk_in_complete: Why are you still using 8 buffers? I think 2 buffers (the default) would give the same performance. It's worth checking. > ----- FIRST REQUEST----- > file-storage-1333 [002] .... 59.898705: do_read: READ: 122880 @ 0 > file-storage-1333 [002] .... 59.898706: do_read: file read 16384 @ 0 > file-storage-1333 [002] .... 59.898709: do_read: file read 16384 @ 0 -> 16384 > file-storage-1333 [002] .... 59.898709: start_transfer: bulk-in start > file-storage-1333 [002] .... 59.898710: net2280_queue: received request > file-storage-1333 [002] d... 59.898711: net2280_queue: starting request > ----- FIRST DMA ------ > file-storage-1333 [002] d... 59.898712: start_dma: start > file-storage-1333 [002] d... 59.898713: start_dma: start queue > file-storage-1333 [002] d... 59.898717: start_dma: end > file-storage-1333 [002] d... 59.898717: net2280_queue: finished > file-storage-1333 [002] .... 59.898717: start_transfer: bulk-in queued ... > ----- FIRST IRQ ----- > <idle>-0 [002] d.h. 59.898828: net2280_irq: start ... > ----- SECOND IRQ ----- > <idle>-0 [002] d.h. 59.898945: net2280_irq: start ... > ----- SECOND REQUEST ----- > file-storage-1333 [002] .... 59.899702: do_read: READ: 8192 @ 122880 > file-storage-1333 [002] .... 59.899702: do_read: file read 8192 @ 122880 > file-storage-1333 [002] .... 59.899703: do_read: file read 8192 @ 122880 -> 8192 > file-storage-1333 [002] .... 59.899704: start_transfer: bulk-in start > file-storage-1333 [002] .... 59.899704: net2280_queue: received request > file-storage-1333 [002] d... 59.899704: net2280_queue: starting request > file-storage-1333 [002] d... 59.899706: start_dma: start > file-storage-1333 [002] d... 59.899707: start_dma: start queue > file-storage-1333 [002] d... 59.899710: start_dma: end > file-storage-1333 [002] d... 59.899710: net2280_queue: finished > file-storage-1333 [002] .... 59.899710: start_transfer: bulk-in queued Why did you cut the log short? The figure I wanted to see here was the time from the IRQ preceding the 8th bulk-in completion to the IRQ preceding the 10th bulk-in completion, which hasn't occurred yet. > First DMA to first IRQ = 116us. > Second DMA to second IRQ = 107us. > These seem fairly stable as the last request in a 500+MB transfer had exactly the same timings. > > Each IRQ is taking around 14us to handle, during which most of the time the ep->dev->lock is locked (only released in the done function whilst calling the callback). > Setting up DMA is taking around 5us to handle. It looks like there's an overhead of about 10 us every 115 or so, for interrupt and DMA processing. That'substantial, and it's possible the driver could be changed to remove some of this overhead, by starting each DMA before the previous one finished. > Measured end-to-end performance: 135MB/s (145MB/s with tweaked max_sector). > That�s not bad given that the DMA is only managing 153MB/s. How do you arrive at that number? 16384 B / 107 us? I'm not sure why you refer to that as the DMA peformance. In fact it is the overall performance for the USB data transfer, which includes driver delays, DMA delays, bus transfer delays, and controller delays. Most of the remaining overhead (i.e., the difference between 135 MB/s and 153 MB/s) comes from the MSC BOT protocol, and as you found, it can be reduced by increasing max_sectors. Even if the DMA setup delay were eliminated, it would increase the transfer rate by no more than 10%, to 168 MB/s or less. That's still not a very impressive fraction of the total bus capacity of 500 MB/s, and the difference can be attributed almost entirely to things that are out of the software's control. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html