On Mon, Sep 22, 2008 at 12:29:18PM -0700, Jesse Barnes wrote: > Matthew was kind enough to set up a BoF for those of us interested in PCI and > MSI issues at this year's LPC. We went over several issues there: MSI, PCI > hotplug, PCIe virtualization, VGA arbitration and PCI address space > management. Jesse, thanks for summarizing and posting this....let me use this as an opportunity to write up an MSI API proposal like I promised. > Issue: > MSI API improvements. Need ways to allocate MSIs at runtime, and perhaps > on a per-CPU level and get affinity information. And of course we need > a way to get more than one legacy MSI allocated. > Owner: > Matthew is handling the first pass of this (the more than one legacy MSI > addition). I think we'll need some detailed requirements from the driver > guys to make further improvements (e.g. per-CPU or affinity stuff). Being one of the "driver guys", let me add my thoughts. For the following discussion, I think we can treat MSI and MSI-X the same and will just say "MSI". The issue is smp_affinity and how drivers want to bind MSIs to specific CPUs based on topology/architecture for optimal performance. "queue pairs" means command/completion queues. "multiple queues" means more than one such pairs. The problem is multi-queue capable devices want to bind MSIs to specific queues. How those queues are bound to each MSI depends on how the device uses the queues. I can think of three cases: 1) 1:1 mapping between queue pairs and MSI. 2) 1:N mapping of MSI to multiple queues - e.g. different classes of service. 3) N:1 mapping of MSI to a queue pair - e.g. different event types (error vs good status). "classes of service" could be 1:N or N:1. "event types" case would typically be 1 command queue with multiple completion queues and one MSI per completion queue. Dave Miller (and others) have clearly stated they don't want to see CPU affinity handled in the device drivers and want irqbalanced to handle interrupt distribution. The problem with this is irqbalanced needs to know how each device driver is binding multiple MSI to it's queues. Some devices could prefer several MSI go to the same processor and others want each MSI bound to a different "node" (NUMA). Without any additional API, this means the device driver has to update irqbalanced for each device it supports. We thought pci_ids.h was a PITA...that would be trivial compared to maintaining this. Initially, at the BOF, I proposed "pci_enable_msix_for_nodes()" to spread out MSI across multiple NUMA nodes by default. CPU Cores which share cache was my definition of a "NUMA Node" for the purpose of this discussion. But each arch would have to define that. The device driver would also need an API to map each "node" to a queue pair as well. In retrospect, I think this API only work well for smaller systems and simple 1:1 MSI/queue mappings. and we'd still have to teach irqbalanced to not touch some MSIs which are "optimally" allocated. A second solution I thought of later might be for the device driver to export (sysfs?) to irqbalanced which MSIs the driver instance owns and how many "domains" those MSIs can serve. irqbalanced can then write back into the same (sysfs?) the mapping of MSI to domains and update the smp_affinity mask for each of those MSI. The driver could quickly look up the reverse map CPUs to "domains". When a process attempts to start an IO, driver wants to know which queue pair the IO should be placed on so the completion event will be handled in the same "domain". The result is IOs could start/complete on the same (now warm) "CPU cache" with minimal spinlock bouncing. I'm not clear on details right now. I belive this would allow irqbalanced to manage IRQs in an optimal way without having to have device specific code in it. Unfortunately, I'm not in a position propose patches due to current work/family commitments. It would be fun to work on. *sigh* I suspect the same thing could be implemented without irqbalanced since I believe process management knows about the same NUMA attributes we care about here...maybe it's time for PM to start dealing with interrupt "scheduling" (kthreads like the RT folks want?) as well? Ok..maybe I should stop before my asbestos underwear aren't sufficient. :) hth, grant -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html