Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Sun, 05 Dec 2021 15:16:40 +0100

On Sat, Dec 04 2021 at 15:20, Thomas Gleixner wrote:
> On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> So I need to break that up in a way which caters for both cases, but
> does neither create a special case for PCI nor for the rest of the
> universe, i.e. the 1:1 case has to be a subset of the 1:2 case which
> means all of it is common case. With that solved the storage question
> becomes a nobrainer.
>
> When I looked at it again yesterday while writing mail, I went back to
> my notes and found the loose ends where I left off. Let me go back and
> start over from there.

I found out why I stopped looking at it. I came from a similar point of
view what you were suggesting:

>> If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
>> completely unable to handle IMS/MSI/MSI-X at all, and instead, when
>> the driver asks for PCI MSI access we create a new hierarchical
>> irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
>> you outlined for IMS?  (again, not saying to do this, but let's ask if
>> that makes more sense than the current configuration)

Which I shot down with:

> That's not really a good idea because dev->irqdomain is a generic
> mechanism and not restricted to the PCI use case. Special casing it for
> PCI is just wrong. Special casing it for all use cases just to please
> PCI is equally wrong. There is a world outside of PCI and x86. 

That argument is actually only partially correct.

After studying my notes and some more code (sigh), it looks feasible
under certain assumptions, constraints and consequences.

Assumptions:

  1) The irqdomain pointer of PCI devices which is set up during device
     discovery is not used by anything else than infrastructure which
     knows how to handle it.

     Of course there is no guarantee, but I'm not that horrified about
     it anymore after chasing the exposure with yet more coccinelle
     scripts.

Constraints:

  1) This is strictly opt-in and depends on hierarchical irqdomains.

     If an architecture/subarchitecture wants to support it then it
     needs to rework their PCI/MSI backend to hierarchical irqdomains or
     make their PCI/MSI irqdomain ready for the task.

     From my inspection of 30+ PCI/MSI irqdomains, most of them should
     be trivial to convert. The hard ones are powerpc, XEN and VMD,
     where XEN is definitely the most convoluted one.

     That means that devices which depend on IMS won't work on anything
     which is not up to date.

  2) Guest support is strictly opt-in

     The underlying architecture/subarchitecture specific irqdomain has
     to detect at setup time (eventually early boot), whether the
     underlying hypervisor supports it.

     The only reasonable way to support that is the availability of
     interrupt remapping via vIOMMU, as we discussed before.

  3) IOMMU/Interrupt remapping dependency

     While IMS just works without interrupt remapping on bare metal the
     fact that there is no reliable way to figure out whether the kernel
     runs on bare metal or not, makes it pretty much mandatory, at least
     on x86.

     That's not a hardcoded constraint. It's a decision made during the
     setup of the underlying architecture/subarchitecture specific
     irqdomain.

  4) The resulting irqdomain hierarchy would ideally look like this:

     VECTOR -> [IOMMU, ROUTING, ...] -> PCI/[MSI/MSI-X/IMS] domains

     That does not work in all cases due to architecture and host
     controller constraints, so we might end up with:

           VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains

     The nice thing about the irqdomain hierarchy concept is that this
     does not create any runtime special cases as the base hierarchy is
     established at boot or device detection time. It's just another
     layer of indirection.

  5) The design rules for the device specific IMS irqdomains have to be
     documented and enforced to the extent possible.

     Rules which I have in my notes as of today:

       - The device specific IMS irq chip / irqdomain has to be strictly
         separated from the rest of the driver code and can only
         interact via the irq chip data which is either per interrupt or
         per device.

         I have some ideas how to enforce these things to go into
         drivers/irqchip/ so they are exposed to scrutiny and not
         burried in some "my device is special" driver code and applied
         by subsystem maintainers before anyone can even look at it. 

       - The irqchip callbacks which can be implemented by these top
         level domains are going to be restricted.

       - For the irqchip callbacks which are allowed/required the rules
         vs. following down the hierarchy need to be defined and
         enforced.

       - To achieve that the registration interface will not be based on
         struct irq_chip. This will be a new representation and the core
         will convert that into a proper irq chip which fits into the
         hierarchy. This provides one central place where the hierarchy
         requirements can be handled as they depend on the underlying
         MSI domain (IOMMU, SHIM, etc.). Otherwise any change on that
         would require to chase the IMS irqchips all over the place.

Consequences:

  1) A more radical split between legacy and hierarchical irqdomain
     code in drivers/pci/msi/ into:

       - api
       - legacy
       - irqdomain
       - shared

     That means that we are going to end up with duplicated code for
     some of the mechanisms up to the point where the stuck-in-the-mud
     parts either get converted or deleted.

  2) The device centric storage concept will stay as it does not make
     any sense to push it towards drivers and what's worse it would be a
     major pain vs. the not yet up to the task irqdomains and the legacy
     architecture backends to change that. I really have no interrest to
     make the legacy code 

     It also makes sense because the interrupts are strictly tied to the
     device. They cannot originate from some disconnected layer of thin
     air.

     Sorry Jason, no tables for you. :)

How to get there:

  1) I'm going to post part 1-3 of the series once more with the fallout
     and review comments addressed.

  2) If nobody objects, I'll merge that into tip irq/msi and work on top
     of that.

     The consolidation makes sense on it's own and is required anyway. I
     might need to revisit some of the already touched places, but that
     should be a halfways limited number. I rather do that step for step
     on top than going back to start and mixing the new concepts in from
     the very beginning.

     But I drop part 4 in it's current form because that's going to be
     part of the new infrastructure.

  3) I'll work on that bottom up towards a driver exposable API as that
     is going to be a result of the final requirements of the underlying
     infrastructure.

     The final driver visible interfaces can be bikeshedded on top to
     make them palatable for driver writers.

  4) Convert x86 PCI/MSI[x] to the new scheme

  5) Implement an IMS user.

     The obvious candidate which should be halfways accessible is the
     ath11 PCI driver which falls into that category.

     It uses horrendous hackery to make it "work" by abusing MSI. It's a
     wonder that it works at all, by some definition of "works".

     I'm pretty sure how to make it fall apart without touching a single
     line of code.

     With a small code change I can make it fail hard without blowing up
     any other MSI/MSI-X user except the switchtec NTB.

     That's a prime example for the way how driver writers behave.

     Instead of talking to the people who are responsible for the
     interrupt subsystem, they go off and do their own thing. It does
     not explode on their test machine, but it's not even remotely close
     to the requirements for PCI drivers to work independent of the
     underlying platform.

     Of course the responsible maintainer does not even notice and waves
     it through without questioning it.

Thoughts?

Thanks,

        tglx