Re: [RFC PATCH 0/9] pci: portdrv: Add auxiliary bus and register CXL PMUs (and aer)

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Wed, 28 Aug 2024 23:11:46 +0200

Jonathan!

On Fri, Aug 23 2024 at 12:05, Jonathan Cameron wrote:
> On Thu, 6 Jun 2024 15:21:02 +0200 Lukas Wunner <lukas@xxxxxxxxx> wrote:
>> On Thu, Jun 06, 2024 at 01:57:56PM +0100, Jonathan Cameron wrote:
>> > Or are you thinking we can make something like the following work
>> > (even when we can't do dynamic msix)?
>> > 
>> > Core bring up facilities and interrupt etc.  That includes bus master
>> > so msi/msix can be issued (which makes me nervous - putting aside other
>> > concerns as IIRC there are drivers where you have to be careful to
>> > tweak registers to ensure you don't get a storm of traffic when you
>> > hit bus master enable.

Correct. You have to be really careful about this.

>> > Specific driver may bind later - everything keeps running until 
>> > the specific driver calls pci_alloc_irq_vectors(), then we have to disable all
>> > interrupts for core managed services, change the number of vectors and
>> > then move them to wherever they end up before starting them again.
>> > We have to do the similar in the unwind path.  
>> 
>> My recollection is that Thomas Gleixner has brought up dynamic MSI-X
>> allocation.  So if both the PCI core services as well as the driver
>> use MSI-X, there's no problem.

Correct. If the core sets up N interrupts for core usage, then devices
can later on allocate MSI-X vectors on top if the underlying interrupt
domain hierarchy supports it. But see below.

>> For MSI, one approach might be to reserve a certain number of vectors
>> in the core for later use by the driver.

MSI is a problem in several aspects.

  1) You really have to know how many vectors you need at the end as you
     need to disable MSI in order to change the number of vectors in the
     config space. So if you do that after interrupts are already in use
     this can cause lost interrupts and stale devices. When MSI is
     disabled the interrupt might be routed to the legacy intX and aside
     of an eventual spurious interrupt message there is no trace of it.

  2) Allocating N vectors upfront and only using a small portion for the
     core and have the rest handy for drivers is problematic when the
     MSI implementation does not provide masking of the individual MSI
     interrupts. The driver probing might result in an interrupt
     storm for obvious reasons.

Aside of that if the core allocates N interrupts then all the resources
required (interrupt descriptors....CPU vectors) are allocated right
away. There is no reasonable way to do that post factum like we can do
with MSI-X. Any attempt to hack that into submission is futile and I
have zero interrest to even look at it.

The shortcomings of MSI are known for two decades and it's sufficiently
documented that it's a horrorshow for software to deal with it. Though
the 'save 5 gates and deal with it in software' camp still has not got
the memo. TBH, I don't care at all.

We can talk about MSI-X, but MSI is not going to gain support for any of
this. That's the only leverage we have to educate people who refuse to
accept reality.

> So, my first attempt at doing things in the core ran into what I think
> is probably a blocker. It's not entirely technical...
>
> +CC Thomas who can probably very quickly confirm if my reasoning is
> correct.
>
> If we move these services into the PCI core itself (as opposed keeping
> a PCI portdrv layer), then we need to bring up MSI for a device before
> we bind a driver to that struct pci_dev / struct device.
>
> If we follow through
> pci_alloc_irq_vectors_affinity()
> -> __pci_enable_msix_range() / __pci_enable_msi_range()
> -> pci_setup_msi_context()
> -> msi_setup_device_data()
> -> devres_add()
> //there is actually devres usage before this in msi_sysfs_create_group()
>   but this is a shorter path to illustrate the point.
>
> We will have registered a dev_res callback on the struct pci_dev
> that we later want to be able to bind a driver to.  That driver
> won't bind - because lifetimes of devres will be garbage
> (see really_probe() for the specific check and resulting
> "Resources present before probing")

That's because you are violating all layering rules. See below.

> So we in theory 'could' provide a non devres path through
> the MSI code, but I'm going to guess that Thomas will not
> be particularly keen on that to support this single use case.

Correct.

> Thomas, can you confirm if this is something you'd consider
> or outright reject? Or is there an existing route to have 
> MSI/X working pre driver bind and still be able to bind
> the driver later.

The initial enable has to set up at least one vector. That vector does
not have to be requested by anything, but that's it. After that you can
allocate with pci_msix_alloc_irq_at().

So looking at the ASCII art of the cover letter:

 _____________ __            ________ __________
|  Port       |  | creates  |        |          |
|  PCI Dev    |  |--------->| CPMU A |          |
|_____________|  |          |________|          |
|portdrv binds   |          | perf/cxlpmu binds |
|________________|          |___________________|
                 \         
                  \
                   \         ________ __________
                    \       |        |          |
                     ------>| CPMU B |          |    
                            |________|          |
                            | perf/cxlpmu binds |
                            |___________________|

AND

 _____________ __ 
|  Type 3     |  | creates                                 ________ _________
|  PCI Dev    |  |--------------------------------------->|        |         |
|_____________|  |                                        | CPMU C |         |
| cxl_pci binds  |                                        |________|         |
|________________|                                        | perf/cxpmu binds |
                                                          |__________________|

If I understand it correctly then both the portdrv and the cxl_pci
drivers create a "bus". The CPMU devices are attached to these buses.

So we are talking about distinctly different devices with the twist that
these devices somehow need to utilize the MSI/X (forget MSI) facility of
the device which creates the bus.