On Mon, Sep 19, 2011 at 11:47:15AM -0400, Neil Horman wrote: > So a while back, I wanted to provide a way for irqbalance (and other apps) to > definitively map irqs to devices, which, for msi[x] irqs is currently not really > possible in user space. My first attempt wen't not so well: > https://lkml.org/lkml/2011/4/21/308 > > It was plauged by the same issues that prior attempts were, namely that it > violated the one-file-one-value sysfs rule. I wandered off but have recently > come back to this. I've got a new implementation here that exports a new > subdirectory for every pci device, called msi_irqs. This subdirectory contanis > a variable number of numbered subdirectories, in which the number represents an > msi irq. Each numbered subdirectory contains attributes for that irq, which > currently is only the mode it is operating in (msi vs. msix). I think fits > within the constraints sysfs requires, and will allow irqbalance to properly map > msi irqs to devices without having to rely on rickety, best guess methods like > interface name matching. This approach feels like building bigger rockets instead of a space elevator :-) What we need is to allow device drivers to ask for per-CPU interrupts, and implement them in terms of MSI-X. I've made a couple of stabs at implementing this, but haven't got anything working yet. It would solve a number of problems: 1. NUMA cacheline fetch. At the moment, desc->istate gets modified by handle_edge_irq. handle_percpu_irq doesn't need to worry about any of that stuff, so doesn't touch desc->istate. I've heard this is a significant problem for the high-speed networking people. 2. /proc/interrupts is unmanagable on large machines. There are hundreds of interrupts and dozens of CPUs. This would go a long way to reducing the number of rows in the table (doesn't do anything about the columns). ie instead of this: 79: 0 0 0 0 0 0 0 0 PCI-MSI-edge eth1 80: 0 0 9275611 0 0 0 0 0 PCI-MSI-edge eth1-TxRx-0 81: 0 0 9275611 0 0 0 0 0 PCI-MSI-edge eth1-TxRx-1 82: 0 0 0 0 9275611 0 0 0 PCI-MSI-edge eth1-TxRx-2 83: 0 0 0 0 9275611 0 0 0 PCI-MSI-edge eth1-TxRx-3 84: 0 0 0 0 0 9275611 0 0 PCI-MSI-edge eth1-TxRx-4 85: 0 0 0 0 0 9275611 0 0 PCI-MSI-edge eth1-TxRx-5 86: 0 0 0 0 0 0 9275611 0 PCI-MSI-edge eth1-TxRx-6 87: 0 0 0 0 0 0 9275611 0 PCI-MSI-edge eth1-TxRx-7 We'd get this: 79: 0 0 0 0 0 0 0 0 PCI-MSI-edge eth1 80: 9275611 9275611 9275611 9275611 9275611 9275611 9275611 9275611 PCI-MSI-edge eth1-TxRx 3. /proc/irq/x/smp_affinity actually makes sense again. It can be a mask of which interrupts are active instead of being a degenerate case in which only the lowest set bit is actually honoured. 4. Easier to manage for the device driver. All it needs is to call request_percpu_irq(...) instead of trying to figure out how many threads/cores/numa nodes/... there are in the machine, and how many other multi-interrupt devices there are; and thus how many interrupts it should allocate. That can be left to the interrupt core which at least has a chance of getting it right. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html