On Tuesday 27 January 2009, Sarah Sharp wrote: > > > I like the current model, whereby URBs deal with only a single > > contiguous DMA buffer. (Possibly one that's made contiguous > > through an IOMMU coalescing pages.) Having a uniform model is > > a big win ... even with the exception whereby ISO transfers > > split that buffer into discrete chunks. So I'd rather keep to > > the model whereby scatterlists are mapped to URBs outside the > > sight of HCDs. > > The problem is that I saw significant performance improvement with USB > 3.0 prototypes when I pushed the scatter gather list down to the xHCI > HCD. The xHCI data structures are just set up in such a way that > queuing a list of scatter gather entries is just natural. That's a discussion we can have more producively when everyone can see what those xHCI data structures are. ;) Are they really that different from EHCI or OHCI? They support queues too. The generic model is "queue" ... not scatterlist, which isn't used much outside the block layer. > The performance increase might have been due to how the device was set up > to do PCI DMA; it might have been due to something else. I can't know > until I run both sets of patches (bulk TX with and without scatter > gather list push down) on multiple host controllers and multiple USB 3.0 > devices. > > Inaky was saying that he would love to see scatter gather lists pushed > down to the HCDs for wireless USB. The USB core forces the scatter > gather list from a driver into one buffer, No, that's the DMA mapping which *MIGHT* do that, on platforms with an IOMMU. Typically each scatterlist entry will be a page or two. An IOMMU can turn a dozen such entries into something that's virtually contiguous in DMA-space. There will still be N buffers in a scatterlist of length N ... but the IOMMU might let it be treated more efficiently. (As I recall, Intel doesn't do much with IOMMUs, except maybe on server hardware.) There are three levels of optimization in the current scatterlist code: - If an IOMMU is available, dma_map_sg() uses it to make the scatterlist shorter. (Which means fewer DMA transfer descriptors, for hardware where that's relevant.) - Each remaining scatterlist entry is submitted asynchronously, so that the HCD receives a queue of transfers to stick in its DMA queue. (On hardware that queues DMA transfers.) - Rather than requiring an IRQ after each scatterlist entry completes, HCDs are told they only need to interrupt on the last one. So for example I've seen individual scatterlists of nearly a megabyte get sent to EHCI, which works on them and then issues a single completion IRQ. > then the wHCI has to break > that buffer apart again and insert more headers in between. That would be a wHCI design issue, I'd think. If it doesn't insert headers automatically, then it's going to have lots of { header, data-fragment } tuples ... which could be designed as fast, or not. If there are DMA transfer descriptors, the worst case is needing separate descriptors for header and then data fragment. Network stacks traditionally avoid that by preallocating space in SKBs for lower levels to add headers ... but USB takes arbitrary buffers; so no SKBs, no header pre-allocation. > If the > upper layer could just submit a scatter gather list down to the HCD and > not have the USB core combine it, that would save a lot of copies. Wanting to do even *ONE* copy is a bad model, and will slow down your I/O performance significantly. The USB stack is set up to facilitate zerocopy I/O, at least so far as the buffers provided to usbcore and HCDs from device drivers. If they're smart, they won't copy data either ... that gets tricky if the data is coming direct from userspace, but it's doable. Or, worst/lazy case, a single copy. - Dave -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html