Jonathan! On Fri, Aug 23 2024 at 12:05, Jonathan Cameron wrote: > On Thu, 6 Jun 2024 15:21:02 +0200 Lukas Wunner <lukas@xxxxxxxxx> wrote: >> On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote: >> > Or are you thinking we can make something like the following work >> > (even when we can't do dynamic msix)? >> > >> > Core bring up facilities and interrupt etc. That includes bus master >> > so msi/msix can be issued (which makes me nervous - putting aside other >> > concerns as IIRC there are drivers where you have to be careful to >> > tweak registers to ensure you don't get a storm of traffic when you >> > hit bus master enable. Correct. You have to be really careful about this. >> > Specific driver may bind later - everything keeps running until >> > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all >> > interrupts for core managed services, change the number of vectors and >> > then move them to wherever they end up before starting them again. >> > We have to do the similar in the unwind path. >> >> My recollection is that Thomas Gleixner has brought up dynamic MSI-X >> allocation. So if both the PCI core services as well as the driver >> use MSI-X, there's no problem. Correct. If the core sets up N interrupts for core usage, then devices can later on allocate MSI-X vectors on top if the underlying interrupt domain hierarchy supports it. But see below. >> For MSI, one approach might be to reserve a certain number of vectors >> in the core for later use by the driver. MSI is a problem in several aspects. 1) You really have to know how many vectors you need at the end as you need to disable MSI in order to change the number of vectors in the config space. So if you do that after interrupts are already in use this can cause lost interrupts and stale devices. When MSI is disabled the interrupt might be routed to the legacy intX and aside of an eventual spurious interrupt message there is no trace of it. 2) Allocating N vectors upfront and only using a small portion for the core and have the rest handy for drivers is problematic when the MSI implementation does not provide masking of the individual MSI interrupts. The driver probing might result in an interrupt storm for obvious reasons. Aside of that if the core allocates N interrupts then all the resources required (interrupt descriptors....CPU vectors) are allocated right away. There is no reasonable way to do that post factum like we can do with MSI-X. Any attempt to hack that into submission is futile and I have zero interrest to even look at it. The shortcomings of MSI are known for two decades and it's sufficiently documented that it's a horrorshow for software to deal with it. Though the 'save 5 gates and deal with it in software' camp still has not got the memo. TBH, I don't care at all. We can talk about MSI-X, but MSI is not going to gain support for any of this. That's the only leverage we have to educate people who refuse to accept reality. > So, my first attempt at doing things in the core ran into what I think > is probably a blocker. It's not entirely technical... > > +CC Thomas who can probably very quickly confirm if my reasoning is > correct. > > If we move these services into the PCI core itself (as opposed keeping > a PCI portdrv layer), then we need to bring up MSI for a device before > we bind a driver to that struct pci_dev / struct device. > > If we follow through > pci_alloc_irq_vectors_affinity() > -> __pci_enable_msix_range() / __pci_enable_msi_range() > -> pci_setup_msi_context() > -> msi_setup_device_data() > -> devres_add() > //there is actually devres usage before this in msi_sysfs_create_group() > but this is a shorter path to illustrate the point. > > We will have registered a dev_res callback on the struct pci_dev > that we later want to be able to bind a driver to. That driver > won't bind - because lifetimes of devres will be garbage > (see really_probe() for the specific check and resulting > "Resources present before probing") That's because you are violating all layering rules. See below. > So we in theory 'could' provide a non devres path through > the MSI code, but I'm going to guess that Thomas will not > be particularly keen on that to support this single use case. Correct. > Thomas, can you confirm if this is something you'd consider > or outright reject? Or is there an existing route to have > MSI/X working pre driver bind and still be able to bind > the driver later. The initial enable has to set up at least one vector. That vector does not have to be requested by anything, but that's it. After that you can allocate with pci_msix_alloc_irq_at(). So looking at the ASCII art of the cover letter: _____________ __ ________ __________ | Port | | creates | | | | PCI Dev | |--------->| CPMU A | | |_____________| | |________| | |portdrv binds | | perf/cxlpmu binds | |________________| |___________________| \ \ \ ________ __________ \ | | | ------>| CPMU B | | |________| | | perf/cxlpmu binds | |___________________| AND _____________ __ | Type 3 | | creates ________ _________ | PCI Dev | |--------------------------------------->| | | |_____________| | | CPMU C | | | cxl_pci binds | |________| | |________________| | perf/cxpmu binds | |__________________| If I understand it correctly then both the portdrv and the cxl_pci drivers create a "bus". The CPMU devices are attached to these buses. So we are talking about distinctly different devices with the twist that these devices somehow need to utilize the MSI/X (forget MSI) facility of the device which creates the bus.