On Mon, Dec 06, 2021 at 04:47:58PM +0100, Thomas Gleixner wrote: > >> - The irqchip callbacks which can be implemented by these top > >> level domains are going to be restricted. > > > > OK - I think it is great that the driver will see a special ops struct > > that is 'ops for device's MSI addr/data pair storage'. It makes it > > really clear what it is > > It will need some more than that, e.g. mask/unmask and as we discussed > quite some time ago something like the irq_buslock/unlock pair, so you > can handle updates to the state from thread context via a command queue > (IIRC). Yes, I was thinking about all of that in here. Let me ask a slightly different question pci_msi_create_irq_domain() hooks into the platforms irq_chip as an alternative to hierarchical irq domains (?) eg: chip->irq_write_msi_msg = pci_msi_domain_write_msg; And we see: void pci_msi_domain_write_msg(struct irq_data *irq_data, struct msi_msg *msg) { struct msi_desc *desc = irq_data_get_msi_desc(irq_data); Now, if we have your idea to have some: struct msi_storage_ops { write_msg mask; unmask; lock; unlock; }; Which is how a driver plugs in its storage operations. In almost all cases 'ops' will come along with a 'state', so lets create one: struct msi_storage { // Look, I avoided the word table! const struct msi_storage_ops *ops }; ie: struct msi_storage_ops { void (*write_msg)(struct msi_storage *msi, struct msi_desc *desc, struct msi_msg *msg); Now, what if we made a 'generic_msi_create_irq_domain()' that hooks the irq_chip with something like this: void generic_msi_domain_write_msg(struct irq_data *irq_data, struct msi_msg *msg) { struct msi_desc *desc = irq_data_get_msi_desc(irq_data); struct msi_storage *msi = desc->msi; msi->ops->write_msg(msi, desc, msg); } And then have what pci_msi_domain_write_msg() did now accomplished by having it set desc->storage to a pci_msi_storage or pci_msix_storage with those msi_storage_ops pointing at pci_msi_domain_write_msg/etc Then we can transform: void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) { struct pci_dev *dev = msi_desc_to_pci_dev(entry); into struct pci_msi_storage { struct msi_storage storage; struct pci_dev *dev; unsigned int msi_cap_off; }; void pci_write_msi64_msg(struct msi_storage *storage, struct msi_desc *desc, struct msi_msg *msg) { struct pci_msi_storage *msi = container_of(storage, struct pci_msi_storage, storage); unsigned int pos = storage->msi_cap_off; struct pci_dev *dev = msi->dev; pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl); msgctl &= ~PCI_MSI_FLAGS_QSIZE; msgctl |= desc->msi_attrib.multiple << 4; pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl); pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO, msg->address_lo); pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_HI, msg->address_hi); pci_write_config_word(dev, pos + PCI_MSI_DATA_64, msg->data); /* Ensure that the writes are visible in the device */ pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl); desc->msg = *msg; // in core code instead ? } And others for the different cases. Look no ifs! OK? Now, we have some duplication between the struct msi_storage_ops and the struct irq_chip. Let's see what that is about: arch/x86/kernel/apic/msi.c: .irq_write_msi_msg = dmar_msi_write_msg, arch/x86/kernel/hpet.c: .irq_write_msi_msg = hpet_msi_write_msg, Surprised! These are actually IMS. The HPET and DMAR devices both have device-specific message storage! So these could use msi_storage_ops. And WTF is IOMMU DMAR driver code doing in apic/msi.c ??? arch/powerpc/platforms/pseries/msi.c: .irq_write_msi_msg = pseries_msi_write_msg, AFAICT this is really like virtualization? The hypervisor is controlling the real MSI table and what the OS sees is faked out somewhat. This is more of quirk in the PCI MSI implementation (do not touch the storage) and a block on non-PCI uses of MSI similar to what x86 needs? drivers/irqchip/irq-gic-v2m.c: .irq_write_msi_msg = pci_msi_domain_write_msg, drivers/irqchip/irq-gic-v3-its-pci-msi.c: .irq_write_msi_msg = pci_msi_domain_write_msg, drivers/irqchip/irq-gic-v3-mbi.c: .irq_write_msi_msg = pci_msi_domain_write_msg, ARM seems to be replacing the 'mask at source' with 'mask at destination' - I wonder why? Should this really be hierarchical where we mask *both* the MSI originating device (storage_ops->mask) and at the CPU IRQ controller? (gicv2m_mask_msi_irq ?) if it can? drivers/base/platform-msi.c: chip->irq_write_msi_msg = platform_msi_write_msg; Oh! this is doing what I kind of just suggested, just non-generically and hacked into platform bus drivers the same as PCI does: static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg) { struct msi_desc *desc = irq_data_get_msi_desc(data); struct platform_msi_priv_data *priv_data; priv_data = desc->platform.msi_priv_data; priv_data->write_msg(desc, msg); } platform_msi entirely gets deleted. Instead all platform drivers using it will use IMS - set up a msi_storage_ops with storage_ops->write_msg == platform_msi_priv_data::write_msg and allocate a msi_storage someplace. So, at least at this first order, we could have world where the irq_chip does not overlap struct msi_storage_ops - ie we move the MSI related things from irq_chip to msi_storage_ops. Then, all the core code places calling into chip->[msi] would instead find the msi_storage_op from the irq_data. ie like (or better) irq_data->desc->storage->ops Now we have a real clear split of responsibility. The irq_domain hierarchy and irq_chip is all about classic wired interrupts and interrupt controller path after the device launches the message. For MSI it's job is to determine the addr/data. msi_storage_ops is all about interrupt origin points that can hold an addr/data pair. It's job is to store the addr/data into the device specific registers, do mask-at-source, locking etc. PCI, HPET, DMAR move to msi_storage_ops instead of using irq_chip Then you get what you asked at the start, the platform's irq_domain is now fully generic and can be 1:N to different MSI storage providers. For API compat every pci struct device will have to instantiate a msi_storage someplace, but that seems easy enough. Seems like a nice uniform solution? Jason