Re: mass storage behaviour

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Sat, 17 Oct 2015 12:30:28 -0400 (EDT)

Paul, can you tell your email client to wrap lines after 72 columns or 
so?  It would make replying a lot easier...

On Fri, 16 Oct 2015, Paul Jones wrote:

> >> The slow queuing of transfers seems to be due to waiting in a spin lock on ep->dev->lock (4+ us).
> >> I�m guessing this is caused by the fact that there sometimes seems to be significant time between when the spin lock is acquired in net2280_queue and when it is released.
> >> Any idea why that may be?
> > 
> > There shouldn't be any contention on that spinlock -- at least, not
> > until the driver's interrupt handler needs to run.  Everything else in 
> > the mass-storage gadget is single-threaded.  I don't know why it should 
> > take so long to acquire the spinlock when nothing else is happening, 
> > unless once again you're seeing interrupts from other devices.
> I am talking about the net2280 lock and not the mass_storage lock.

So was I.

> > One thing you did not show here was the delay between sending the last
> > data buffer of one command and sending the first data buffer of the
> > next command.  That overhead could significantly affect the overall
> > throughput.
> The next read command is generally handled within 20us after the previous one. 
> Switched over to ftrace to include the net2280 irq/dma handling, logging of the spin locks and thread handling in bulk_in_complete:

Why are you still using 8 buffers?  I think 2 buffers (the default)
would give the same performance.  It's worth checking.

> ----- FIRST REQUEST-----
>     file-storage-1333  [002] ....    59.898705: do_read: READ: 122880 @ 0
>     file-storage-1333  [002] ....    59.898706: do_read: file read 16384 @ 0
>     file-storage-1333  [002] ....    59.898709: do_read: file read 16384 @ 0 -> 16384
>     file-storage-1333  [002] ....    59.898709: start_transfer: bulk-in start
>     file-storage-1333  [002] ....    59.898710: net2280_queue: received request
>     file-storage-1333  [002] d...    59.898711: net2280_queue: starting request
> ----- FIRST DMA ------
>     file-storage-1333  [002] d...    59.898712: start_dma: start
>     file-storage-1333  [002] d...    59.898713: start_dma: start queue
>     file-storage-1333  [002] d...    59.898717: start_dma: end
>     file-storage-1333  [002] d...    59.898717: net2280_queue: finished
>     file-storage-1333  [002] ....    59.898717: start_transfer: bulk-in queued

...

> ----- FIRST IRQ -----
>           <idle>-0     [002] d.h.    59.898828: net2280_irq: start

...

> ----- SECOND IRQ -----
>           <idle>-0     [002] d.h.    59.898945: net2280_irq: start

...

> ----- SECOND REQUEST -----
>     file-storage-1333  [002] ....    59.899702: do_read: READ: 8192 @ 122880
>     file-storage-1333  [002] ....    59.899702: do_read: file read 8192 @ 122880
>     file-storage-1333  [002] ....    59.899703: do_read: file read 8192 @ 122880 -> 8192
>     file-storage-1333  [002] ....    59.899704: start_transfer: bulk-in start
>     file-storage-1333  [002] ....    59.899704: net2280_queue: received request
>     file-storage-1333  [002] d...    59.899704: net2280_queue: starting request
>     file-storage-1333  [002] d...    59.899706: start_dma: start
>     file-storage-1333  [002] d...    59.899707: start_dma: start queue
>     file-storage-1333  [002] d...    59.899710: start_dma: end
>     file-storage-1333  [002] d...    59.899710: net2280_queue: finished
>     file-storage-1333  [002] ....    59.899710: start_transfer: bulk-in queued

Why did you cut the log short?  The figure I wanted to see here was the
time from the IRQ preceding the 8th bulk-in completion to the IRQ
preceding the 10th bulk-in completion, which hasn't occurred yet.

> First DMA to first IRQ = 116us.
> Second DMA to second IRQ = 107us.
> These seem fairly stable as the last request in a 500+MB transfer had exactly the same timings.
> 
> Each IRQ is taking around 14us to handle, during which most of the time the ep->dev->lock is locked (only released in the done function whilst calling the callback).
> Setting up DMA is taking around 5us to handle.

It looks like there's an overhead of about 10 us every 115 or so, for
interrupt and DMA processing.  That'substantial, and it's possible the
driver could be changed to remove some of this overhead, by starting
each DMA before the previous one finished.

> Measured end-to-end performance: 135MB/s (145MB/s with tweaked max_sector).
> That�s not bad given that the DMA is only managing 153MB/s.

How do you arrive at that number?  16384 B / 107 us?  I'm not sure why
you refer to that as the DMA peformance.  In fact it is the overall
performance for the USB data transfer, which includes driver delays,
DMA delays, bus transfer delays, and controller delays.  Most of the
remaining overhead (i.e., the difference between 135 MB/s and 153 MB/s)
comes from the MSC BOT protocol, and as you found, it can be reduced by
increasing max_sectors.

Even if the DMA setup delay were eliminated, it would increase the
transfer rate by no more than 10%, to 168 MB/s or less.  That's still
not a very impressive fraction of the total bus capacity of 500 MB/s,
and the difference can be attributed almost entirely to things that are
out of the software's control.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html