"Dey, Megha" <megha.dey@xxxxxxxxx> writes: > On 8/11/2020 2:53 AM, Thomas Gleixner wrote: >>> And the annoying fact that you need XEN support which opens another can >>> of worms... > > hmm I am not sure why we need Xen support... are you referring to idxd > using xen? What about using IDXD when you are running on XEN? I might be missing something and IDXD/IMS is hypervisor only, but that still does not solve this problem on bare metal: >> x86 still does not associate the irq domain to devices at device >> discovery time, i.e. the device::msi_domain pointer is never >> populated. We can't do that right now due to the way how X86 PCI/MSI allocation works and being able to do so would make things consistent and way simpler even for your stuff. >> The right thing to do is to convert XEN MSI support over to proper irq >> domains. This allows to populate device::msi_domain which makes a lot of >> things simpler and also more consistent. > > do you think this cleanup is to be a precursor to my patches? I could > look into it but I am not familiar with the background of Xen > > and this stuff. Can you please provide further guidance on where to > look As I said: >> So to support this new fangled device MSI stuff we'd need yet more >> x86/xen specific arch_*msi_irqs() indirection and hackery, which is not >> going to happen. git grep arch_.*msi_irq arch/x86 This indirection prevents storing the irq_domain pointer in the device at probe/detection time. Native code already uses irq domains for PCI/MSI but we can't exploit the full potential because then pci_msi_setup_msi_irqs() would never end up in arch_setup_msi_irqs() which breaks XEN. I was reminded of that nastiness when I was looking at sensible ways to integrate this device MSI maze proper. >From a conceptual POV this stuff, which is not restricted to IDXD at all, looks like this: ]-------------------------------------------| PCI BUS -- | PCI device | ]-------------------| | | Physical function | | ]-------------------| | ]-------------------|----------| | | Control block for subdevices | | ]------------------------------| | | | <- "Subdevice BUS" | | | | | |-- Subddevice 0 | | |-- Subddevice 1 | | |-- ... | | |-- Subddevice N | ]-------------------------------------------| It does not matter whether this is IDXD with it's magic devices or a network card with a gazillion of queues. Conceptually we need to look at them as individual subdevices. And obviously the above picture gives you the topology. The physical function device belongs to PCI in all aspects including the MSI interrupt control. The control block is part of the PCI device as well and it even can have regular PCI/MSI interrupts for its own purposes. There might be devices where the Physical function device does not exist at all and the only true PCI functionality is the control block to manage subdevices. That does not matter and does not change the concept. Now the subdevices belong topology wise NOT to the PCI part. PCI is just the transport they utilize. And their irq domain is distinct from the PCI/MSI domain for reasons I explained before. So looking at it from a Linux perspective: pci-bus -> PCI device (managed by PCI/MSI domain) - PF device - CB device (hosts DEVMSI domain) | "Subdevice bus" | - subdevice | - subdevice | - subdevice Now you would assume that figuring out the irq domain which the DEVMSI domain serving the subdevices on the subdevice bus should take as parent is pretty trivial when looking at the topology, right? CB device's parent is PCI device and we know that PCI device MSI is handled by the PCI/MSI domain which is either system wide or per IR unit. So getting the relevant PCI/MSI irq domain is as simple as doing: pcimsi_domain = pcidevice->device->msi_domain; and then because we know that this is a hierarchy the parent domain of pcimsi_domain is the one which is the parent of our DEVMSI domain, i.e.: parent = pcmsi_domain->parent; Obvious, right? What's not so obvious is that pcidevice->device->msi_domain is not populated on x86 and trying to get the parent from there is a NULL pointer dereference which does not work well. So you surely can hack up some workaround for this, but that's just proliferating crap. We want this to be consistent and there is absolutely no reason why that network card with the MSI storage in the queue data should not work on any other architecture. We do the correct association already for IOMMU and whatever topological stuff is attached to (PCI) devices on probe/detection time so making it consistent for irq domains is just a logical consequence and matter of consistency. Back in the days when x86 was converted to hierarchical irq domains in order to support I/O APIC hotplug this workaround was accepted to make progress and it was meant as a transitional step. Of course after the goal was achieved nobody @Intel cared anymore and so far this did not cause big problems. But now it does and we really want to make this consistent first. And no we are not making an exception for IDXD either just because that's Intel only. Intel is not special and not exempt from cleaning stuff up before adding new features especially not when the stuff to cleanup is a leftover from Intel itself. IOW, we are not adding more crap on top of crap which should not exists anymore. It's not rocket science to fix this. All it needs is to let XEN create irq domains and populate them during init. On device detection/probe the proper domain needs to be determined which is trivial and then stored in device->msi_domain. That makes arch_.*_msi_irq() go away and a lot of code just simpler. Thanks, tglx