Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI irq domain

"Dey, Megha" <megha.dey@xxxxxxxxx> · Tue, 11 Aug 2020 11:39:21 -0700

Hi Thomas,

On 8/8/2020 12:47 PM, Thomas Gleixner wrote:
Megha,

"Dey, Megha" <megha.dey@xxxxxxxxx> writes:
On 8/7/2020 9:47 AM, Thomas Gleixner wrote:
I'm all for sharing code and making the life of driver writers simple
because that makes my life simple as well, but not by creating a layer
at the wrong level and then hacking it into submission until it finally
collapses.

Designing the infrastructure following the clear layering rules of
hierarchical domains so it works for IMS and also replaces the platform
MSI hack is the only sane way to go forward, not the other way round.
  From what I've gathered, I need to:

1. Get rid of the mantra that "IMS" is an extension of platform-msi.
2. Make this new infra devoid of any platform-msi references
See below.
ok..

3. Come up with a ground up approach which adheres to the layering
constraints of the IRQ subsystem
Yes. It's something which can be used by all devices which have:

    1) A device specific irq chip implementation including a msi write function
    2) Device specific resource management (slots in the IMS case)

The infrastructure you need is basically a wrapper around the core MSI
domain (similar to PCI, platform-MSI etc,) which provides the specific
functionality to handle the above.

ok, i will create a per device irq chip which will directly have the 
device specific callbacks instead of another layer of redirection.

This way i will get rid of the 'platform_msi_ops' data structure.

I am not sure what you mean by device specific resource management, are 
you referring to dev_msi_alloc/free_irqs?

4. Have common code (drivers/irqchip maybe??) where we put in all the
generic ims-specific bits for the IRQ chip and domain
which can be used by all device drivers belonging to this "IMS"class.
Yes, you can provide a common implementation for devices which share the
same irq chip and domain (slot management functionality)
yeah i think most of the msi_domain_ops (msi_prepare, set_desc etc) are 
common and can be moved into a common file.

5. Have the device driver do the rest:
      create the chip/domain (one chip/domain per device?)
      provide device specific callbacks for masking, unmasking, write
  message
Correct, but you don't need any magic new data structures for that, the
existing msi_domain_info/msi_domain_ops and related structures are
either sufficient or can be extended when necessary.

So for the IDXD case you need:

   1) An irq chip with mask/unmask callbacks and a write msg function
   2) A slot allocation or association function and their 'free'
      counterpart (irq_domain_ops)

This is one part I didn't understand.

Currently my dev_msi_alloc_irqs is simply a wrapper over 
platform_msi_domain_alloc_irqs which again mostly calls 
msi_domain_alloc_irqs.

When you say add a .alloc, .free, does this mean we should add a device 
specific alloc/free and not use the default 
msi_domain_alloc/msi_domain_free?

I don't see anything device specific to be done for IDXD atleast, can 
you please let me know?

The function and struct pointers go into the appropriate
msi_info/msi_ops structures along with the correct flags to set up the
whole thing and then the infrastructure creates your domain, fills in
the shared functions and sets the whole thing up.

That's all what a device driver needs to provide, i.e. stick the device
specific functionality into right data structures and let the common
infrastructure deal with it. The rest just works and the device specific
functions are invoked from the right places when required.
yeah. makes sense..

So from the hierarchical domain standpoint, we will have:
- For DSA device: vector->intel-IR->IDXD
- For Jason's device: root domain-> domain A-> Jason's device's IRQ domain
- For any other intel IMS device in the future which
      does not require interrupt remapping: vector->new device IRQ domain
      requires interrupt remapping: vector->intel-IR->new device IRQ
domain (i.e. create a new domain even though IDXD is already present?)
What's special about IDXD? It's just one specific implementation of IMS
and any other device implementing IMS is completely independent and as
documented in the specification the IMS slot management and therefore
the mask/unmask functionality can and will be completely different. IDXD
has a storage array with slots, Jason's hardware puts the IMS slot into
the queue storage.

It does not matter whether a device comes from Intel or any other vendor,
it does neither matter whether the device works with direct vector
delivery or interrupt remapping.

IDXD is not any different from any other IMS capable device when you
look at it from the interrupt hierarchy. It's either:

      vector -> IR -> device
or
      vector -> device

The only point where this is differentiated is when the irq domain is
created. Anything else just falls into place.
yeah, so I will create the IRQ domain in the IDXD driver with INTEL-IR 
as the parent, instead of creating a common per IR unit domain

To answer Jason's question: No, the parent is never the PCI/MSI irq
domain because that sits at the same level as that device
domain. Remember the scheme:

    vector --- DMAR-MSI
	  |
	  |-- ....
	  |
	  |-- IR-0 --- IO/APIC-0
	  |        |
	  |        |-- IO/APIC-1
	  |        |
	  |        |-- PCI/MSI-0
	  |        |
	  |        |-- HPET/MSI-0
	  |        |
	  |        |-- DEV-A/MSI-0
	  |        |-- DEV-A/MSI-1
	  |        |-- DEV-B/MSI-2
	  |
	  |-- IR-1 --- PCI/MSI-1
	  |        |
	  |        |-- DEV-C/MSI-3

The PCI/MSI domain(s) are dealing solely with PCI standard compliant
MSI/MSI-X. IMS or similar (platform-MSI being one variant) sit at the
same level as the PCI/MSI domains.

Why? It's how the hardware operates.

The PCI/MSI "irq chip" is configured by the PCI/MSI domain level and it
sends its message to the interrupt parent in the hierarchy, i.e. either
to the Interrupt Remap unit or to the configured vector of the target
CPU.

IMS does not send it to some magic PCI layer first at least not at the
conceptual level. The fact that the message is transported by PCIe does
not change that at all. PCIe in that case is solely the transport, but
the "irq chip" at the PCI/MSI level of the device is not involved at
all. If it were that would be a different story.

So now you might ask why we have a single PCI/MSI domain per IR unit and
why I want seperate IMS domains.

The answer is in the hardware again. PCI/MSI is uniform accross devices
so the irq chip and all of the domain functionality can be shared. But
then we have two PCI/MSI domains in the above example because again the
hardware has one connected to IR unit 0 and the other to IR unit 1.
IR 0 and IR 1 manage different resources (remap tables) so PCI/MSI-0
depends on IR-0 and PCI/MSI-1 on IR-1 which is reflected in the
parent/child relation ship of the domains.

There is another reason why we can spawn a single PCI/MSI domain per
root complex / IR unit. The PCI/MSI domains are not doing any resource
management at all. The resulting message is created from the allocated
vector (direct CPU delivery) or from the allocated Interrupt remapping
slot information. The domain just deals with the logic required to
handle PCI/MSI(X) and the necessary resources are provided by the parent
interrupt layers.

IMS is different. It needs device specific resource management to
allocate an IMS slot which is clearly part of the "irq chip" management
layer, aka. irq domain. If the IMS slot management would happen in a
global or per IR unit table and as a consequence the management, layout,
mask/unmask operations would be uniform then an IMS domain per system or
IR unit would be the right choice, but that's not how the hardware is
specified and implemented.

Now coming back to platform MSI. The way it looks is:

  CPU --- (IR) ---- PLATFORM-MSI  --- PLATFORM-DEVICE-MSI-0
                                  |-- PLATFORM-DEVICE-MSI-1
                                  |...

PLATFORM-MSI is a common resource management which also provides a
shared interrupt chip which operates at the PLATFORM-MSI level with one
exception:

   The irq_msi_write_msg() callback has an indirection so the actual
   devices can provide their device specific msi_write_msg() function.

That's a borderline abuse of the hierarchy, but it makes sense to some
extent as the actual PLATFORM-MSI domain is a truly shared resource and
the only device specific functionality required is the message
write. But that message write is not something which has it's own
resource management, it's just a non uniform storage accessor. IOW, the
underlying PLATFORM-MSI domain does all resource management including
message creation and the quirk allows to write the message in the device
specific way. Not that I love it, but ...

That is the main difference between platform MSI and IMS. IMS is
completely non uniform and the devices do not share any common resource
or chip functionality. Each device has its own message store management,
slot allocation/assignment and a device specifc interrupt chip
functionality which goes way beyond the nasty write msg quirk.
Thanks for giving such a detailed explanation! really helps :)

What I still don't understand fully is what if all the IMS devices
need the same domain ops and chip callbacks, we will be creating
various instances of the same IRQ chip and domain right? Is that ok?
Why would it be not ok? Are you really worried about a few hundred bytes
of memory required for this?

Sharing an instance only makes sense if the instance handles a shared or
uniform resource space, which is clearly not the case with IMS.

We create several PCI/MSI domains and several IO/APIC domains on larger
systems. They all share the code, but they are dealing with seperate
resources so they have seperate storage.
ok, got it ..

Currently the creation of the IRQ domain happens at the IR level so that
we can reuse the same domain but if it advisable to have a per device
interrupt domain, I will shift this to the device driver.
Again. Look at the layering. What you created now is a pseudo shared
domain which needs

    1) An indirection layer for providing device specific functions

    2) An extra allocation layer in the device specific driver to assign
       IMS slots completely outside of the domain allocation mechanism.
hmmm, again I am not sure of which extra allocation layer you are 
referring to..

    In other words you try to make things which are neither uniform nor
    share a resource space look the same way. That's the "all I have is a
    hammer so everything is a nail" approach. That never worked out well.

With a per device domain/chip approach you get one consistent domain
per device which provides

    1) The device specific resource management (i.e. slot allocation
       becomes part of the irq domain operations)

    2) The device specific irq chip functions at the correct point in the
       layering without the horrid indirections

    3) Consolidated data storage at the device level where the actual
       data is managed.

    This is of course sharing as much code as possible with the MSI core
    implementation.

    As a side effect any extension of this be it on the domain or the irq
    chip side is just a matter of adding the functionality to that
    particular incarnation and not by having yet another indirection
    logic at the wrong place.

The price you pay is a bit of memory but you get a clean layering and
seperation of functionality as a reward. The amount of code in the
actual IMS device driver is not going to be much more than with the
approach you have now.

The infrastructure itself is not more than a thin wrapper around the
existing msi domain infrastructure and might even share code with
platform-msi.

From your explanation:

In the device driver:

static const struct irq_domain_ops idxd_irq_domain_ops = {

.alloc= idxd_domain_alloc, //not sure what this should do

.free= idxd_domain_free,

};

struct irq_chip idxd_irq_chip = {

.name= "idxd"

.irq_mask= idxd_irq_mask,

.irq_unmask= idxd_irq_unmask,

.irq_write_msg = idxd_irq_write_msg,

.irq_ack= irq_chip_ack_parent,

.irq_retrigger= irq_chip_retrigger_hierarchy,

.flags= IRQCHIP_SKIP_SET_WAKE,

}

struct msi_domain_info idxd_domain_info = {

.flags        =MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,

.ops          =&dev_msi_domain_ops,//can be common

.chip         =&idxd_irq_chip //per device

.handler= handle_edge_irq,

.handler_name = "edge",

}

dev->msi_domain = dev_msi_create_irq_domain(iommu->ir_domain, 
idxd_domain_info, idxd_irq_domain_ops)

msi_domain_alloc_irqs(dev->msi_domain, dev, nvec);

Common code:

struct irq_domain *dev_msi_create_irq_domain(struct irq_domain *parent,

struct msi_domain_info *dev_msi_domain_info,

struct irq_domain_ops dev_msi_irq_domain_ops)

{

struct irq_domain *domain;

        .......

domain = irq_domain_create_hierarchy(parent, IRQ_DOMAIN_FLAG_MSI, 0,
NULL, &dev_msi_irq_domain_ops, info);

        .......

return domain;

}

static struct msi_domain_ops dev_msi_domain_ops = {

.set_desc= dev_msi_set_desc,

.msi_prepare= dev_msi_prepare,

.get_hwirq= dev_msi_get_hwirq,

}; // can re-use platform-msi data structures

except the alloc/free irq_domain_ops, does this look fine to you?

-Megha

Thanks,

         tglx