On Fri, 6 Sep 2024 18:18:32 +0100 Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > On Fri, 06 Sep 2024 12:11:36 +0200 > Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > > On Thu, Sep 05 2024 at 12:23, Jonathan Cameron wrote: > > >> Look at how the PCI device manages interrupts with the per device MSI > > >> mechanism. Forget doing this with either one of the legacy mechanisms. > > >> > > >> 1) It creates a hierarchical interrupt domain and gets the required > > >> resources from the provided parent domain. The PCI side does not > > >> care whether this is x86 or arm64 and it neither cares whether the > > >> parent domain does remapping or not. The only way it cares is about > > >> the features supported by the different parent domains (think > > >> multi-MSI). > > >> > > >> 2) Driver side allocations go through the per device domain > > >> > > >> That AUXbus is not any different. When the CPMU driver binds it wants to > > >> allocate interrupts. So instead of having a convoluted construct > > >> reaching into the parent PCI device, you simply can do: > > >> > > >> 1) Let the cxl_pci driver create a MSI parent domain and set that in > > >> the subdevice::msi::irqdomain pointer. > > >> > > >> 2) Provide cxl_aux_bus_create_irqdomain() which allows the CPMU device > > >> driver to create a per device interrupt domain. > > >> > > >> 3) Let the CPMU driver create its per device interrupt domain with > > >> the provided parent domain > > >> > > >> 4) Let the CPMU driver allocate its MSI-X interrupts through the per > > >> device domain > > >> > > >> Now on the cxl_pci side the AUXbus parent interrupt domain allocation > > >> function does: > > >> > > >> if (!pci_msix_enabled(pdev)) > > >> return pci_msix_enable_and_alloc_irq_at(pdev, ....); > > >> else > > >> return pci_msix_alloc_irq_at(pdev, ....); > > > > Ignore the above. Brainfart. > > > > > I'm struggling to follow this suggestion > > > Would you expect the cxl_pci MSI parent domain to set it's parent as > > > msi_domain = irq_domain_create_hierarchy(dev_get_msi_domain(&pdev->dev), > > > IRQ_DOMAIN_FLAG_MSI_PARENT, > > > ... > > > which seems to cause a problem with deadlocks in __irq_domain_alloc_irqs() > > > or create a separate domain structure and provide callbacks that reach into > > > the parent domain as necessary? > > > > > > Or do I have this entirely wrong? I'm struggling to relate what existing > > > code like PCI does to what I need to do here. > > > > dev_get_msi_domain(&pdev->dev) is a nightmare due to the 4 different > > models we have: > > > > - Legacy has no domain > > > > - Non-hierarchical domain > > > > - Hierarchical domain v1 > > > > That's the global PCI/MSI domain > > > > - Hierarchical domain v2 > > > > That's the underlying MSI_PARENT domain, which is on x86 > > either the interrupt remap unit or the vector domain. On arm64 > > it's the ITS domain. > > > > See e.g. pci_msi_domain_supports() which handles the mess. > > > > Now this proposal will only work on hierarchical domain v2 because that > > can do the post enable allocations on MSI-X. Let's put a potential > > solution for MSI aside for now to avoid complexity explosion. > > > > So there are two ways to solve this: > > > > 1) Implement the CXL MSI parent domain as disjunct domain > > > > 2) Teach the V2 per device MSI-X domain to be a parent domain > > > > #1 Looks pretty straight forward, but does not seemlessly fit the > > hierarchical domain model and creates horrible indirections. > > > > #2 Is the proper abstraction, but requires to teach the MSI core code > > about stacked MSI parent domains, which should not be horribly hard > > or maybe just works out of the box. > > > > The PCI device driver invokes the not yet existing function > > pci_enable_msi_parent_domain(pci_dev). This function does: > > > > A) validate that this PCI device has a V2 parent domain > > > > B) validate that the device has enabled MSI-X > > > > C) validate that the PCI/MSI-X domain supports dynamic MSI-X > > allocation > > > > D) if #A to #C are true, it enables the PCI device parent domain > > > > I made #B a prerequisite for now, as that's an orthogonal problem, which > > does not need to be solved upfront. Maybe it does not need to be solved > > at all because the underlying PCI driver always allocates a management > > interrupt before dealing with the underlying "bus", which is IMHO a > > reasonable expectation. At least it's a reasonable restriction for > > getting started. > > > > That function does: > > > > msix_dom = pci_msi_get_msix_domain(pdev); > > msix_dom->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; > > msix_dom->msi_parent_ops = &pci_msix_parent_ops; > > > > When creating the CXL devices the CXL code invokes > > pci_msi_init_subdevice(pdev, &cxldev->device), which just does: > > > > dev_set_msi_domain(dev, pci_msi_get_msix_subdevice_parent(pdev)); > > > > That allows the CXL devices to create their own per device MSI > > domain via a new function pci_msi_create_subdevice_irq_domain(). > > > > That function can just use a variant of pci_create_device_domain() with > > a different domain template and a different irqchip, where the irqchip > > just redirects to the underlying parent domain chip, aka PCI-MSI-X. > > > > I don't think the CXL device will require a special chip as all they > > should need to know is the linux interrupt number. If there are special > > requirements then we can bring the IMS code back or do something similar > > to the platform MSI code. > > > > Then you need pci_subdev_msix_alloc_at() and pci_subdev_msix_free() > > which are the only two functions which the CXL (or similar) need. > > > > The existing pci_msi*() API function might need a safety check so they > > can't be abused by subdevices, but that's a problem for a final > > solution. > > > > That's pretty much all what it takes. Hope that helps. > > Absolutely! Thanks for providing the detailed suggestion > this definitely smells a lot less nasty than previous approach. > > I have things sort of working now but it's ugly code with a few > cross layer hacks that need tidying up (around programming the > msi registers from wrong 'device'), so may be a little > while before I get it in a state to post. Hi Thomas, My first solution had some callbacks where we created a local descriptor so that I could swap in the pci_dev->dev and just use the various standard pci/msi/irqdomain.c functions. The 'minimal' solution seems to be a bit ugly. ` There are quite a few places that make the assumption that the preirq_data->parent->chip is the right chip to for example call irq_set_affinity on. So the simple way to make it all work is to just have a msi_domain_template->msi_domain_ops->prepare_desc that sets the desc->dev = to the parent device (here the pci_dev->dev) At that point everything more or less just works and all the rest of the callbacks can use generic forms. Alternative might be to make calls like the one in arch/x86/kernel/apic/msi.c msi_set_affinity search for the first ancestor device that has an irq_set_affinity(). That unfortunately may affect quite a few places. Anyhow, I'll probably send out the prepare_desc hack set with driver usage etc after I've written up a proper description of problems encountered etc so you can see what it all looks like and will be more palatable in general but in the meantime this is the guts of it of the variant where the subdev related desc has the dev set to the parent device. Note for the avoidance of doubt, I don't much like this solution but maybe it will grow on me ;) Jonathan