Re: [PATCH] PCI: dwc: Chain the set IRQ affinity request back to the parent

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Thu, 06 Mar 2025 08:44:43 +0100

On Wed, Mar 05 2025 at 19:21, Tsai Sung-Fu wrote:
> On Tue, Mar 4, 2025 at 5:46 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>> >> > +     ret = desc_parent->irq_data.chip->irq_set_affinity(&desc_parent->irq_data,
>> >> > +                                                        mask_result, force);
>> >>
>> >> Again. Completely unserialized.
>> >>
>> >
>> > The reason why we remove the desc_parent->lock protection is because
>> > the chained IRQ structure didn't export parent IRQ to the user space, so we
>> > think this call path should be the only caller. And since we have &pp->lock hold
>> > at the beginning, so that should help to protect against concurrent
>> > modification here ?
>>
>> "Should be the only caller" is not really a technical argument. If you
>> make assumptions like this, then you have to come up with a proper
>> explanation why this is correct under all circumstances and with all
>> possible variants of parent interrupt chips.
>>
>> Aside of that fiddling with the internals of interrupt descriptors is
>> not really justified here. What's wrong with using the existing
>> irq_set_affinity() mechanism?
>>
> Using irq_set_affinity() would give us some circular deadlock warning
> at the probe stage.
> irq_startup() flow would hold the global lock from irq_setup_affinity(), so
> deadlock scenario like this below might happen. The irq_set_affinity()
> would try to acquire &irq_desc_lock_class
> while the dw_pci_msi_set_affinity() already hold the &pp->lock. To
> resolve that we would like to reuse the idea to
> replace the global lock in irq_set_affinity() with PERCPU structure
> from this patch ->
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=64b6d1d7a84538de34c22a6fc92a7dcc2b196b64
> Just would like to know if this looks feasible to you ?

It won't help. See below.

>   Chain exists of:
>   &irq_desc_lock_class --> mask_lock --> &pp->lock
>   Possible unsafe locking scenario:
>         CPU0                            CPU1
>         ----                                   ----
>    lock(&pp->lock);
>                                                lock(mask_lock);
>                                                lock(&pp->lock);
>    lock(&irq_desc_lock_class);

Indeed.

Back to irq_set_affinity(). I did not think about the more problematic
issue with calling irq_set_affinity(): The per CPU mask is then used
recursive, which is a bad idea.

But looking deeper, your implementation breaks some other things, like
e.g. the affinity coupling of interrupt threads on the MSI side. They
have to be coupled to the effective affinity of the underlying hard
interrupt. Solvable problem, but not in the driver.

Aside of the fact that this hardware design is a trainwreck, which
defeats the intended flexibility of MSI, solving this at the chip driver
level ends up in a incomplete and hacky solution. Also as this is not
the only IP block, which was cobbled together mindlessly that way, we
really don't want to proliferate this horrorshow all over the place.

This really wants a generic infrastructure which handles all aspects of
this correctly once and for all of these hardware trainwrecks. That
needs some thoughts and effort, but that's definitely preferred over a
half baked driver specific hackery, which fiddles in the guts of the
interrupt descriptors as it sees fit. From a layering POV, a interrupt
chip driver should not even care about the fact that an interrupt
descriptor exists. We've spent a lot of effort to decouple these things
and just because C allows it, does not mean it's a correct or a good
idea to ignore the layering.

The approach you have taken has a massive downside:

    It forces all (32) MSI interrupts, which are demultiplexed from the
    underlying parent interrupt, to have the same affinity and the user
    space interface in /proc/irq/*/affinity becomes more than
    confusing.

    This is not what user space expects from an interface which controls
    the affinity of an individual interrupt. No user and no tool expects
    that it has to deal with conflicting settings for MSI interrupts,
    which are independent of each other.

    Tools like irqbalanced will be pretty unhappy about the behaviour
    and you have to teach them about the severe limitations of this.

I'm absolutely not convinced that this solves more problems than it
creates.

In fact, you can achieve the very same outcome with a clean and straight
forward ten line change:

   Instead of the hidden chained handlers, request regular interrupts
   for demultiplexing and give them proper names 'DWC_MSI_DEMUX-0-31',
   'DWC_MSI_DEMUX-32-63'...

   That provides exactly the same mechanism w/o all the hackery and
   creates the same (solvable) problem vs. interrupt thread affinity,
   but provides an understandable user interface.

No?

If you really want to provide a useful and flexible mechanism to steer
the affinity of the individual MSI interrupts, which are handled by each
control block, then you have to think about a completely different
approach.

The underlying interrupt per control block will always be affine to a
particular CPU. But it's not written in stone, that the demultiplexing
interrupt actually has to handle the pending interrupts in the context
of the demultiplexer, at least not for MSI interrupts, which are edge
type and from the PCI device side considered 'fire and forget'.

So right now the demultiplex handler does:

   bits = read_dbi(...);
   for_each_bit(bit, bits)
        generic_handle_domain_irq(domain, base + bit);

generic_handle_irq_domain() gets the interrupt mapping for the hardware
interrupt number and invokes the relevant flow handler directly from the
same context, which means that the affinity of all those interrupts is
bound to the affinity of the underlying demultiplex interrupt.

Iignoring the devil in the details, this can also be handled by
delegating the handling to a different CPU:

   bits = read_dbi(...);
   for_each_bit(bit, bits)
        generic_handle_irq_demux_domain(domain, base + bit);

where generic_handle_demux_domain_irq() does:

      desc = irq_resolve_mapping(domain, hwirq);

      if (desc->target_cpu == smp_processor_id()) {
      	   handle_irq_desc(desc);
      } else {
          // Issue DWC ack() here? - See below
          irq_work_queue_on(&desc->irq_work, desc->target_cpu);
      }

The generic irq_work handler does:

      desc = container_of(work, struct irq_desc, irq_work);
      handle_irq_desc(desc);

That's obviously asynchronous, but in case of edge type MSI interrupts,
that's not a problem at all.

The real problems are:

    1) Handling the pending bit of the interrupt in the DBI, i.e. what
       dw_pci_bottom_ack() does.

    2) Life time of the interrupt descriptor

    3) Slightly increased latency of the rerouted interrupts

    4) Affinity of the demultiplex interrupt

#1 Needs some thoughts. From the top of my head I think it would require
   that dw_pci_bottom_ack() is issued in the context of the demultiplex
   handler as it would otherwise be retriggered, but that just needs
   some research in the datasheet and experimental validation.

   The actual interrupt flow handler for the MSI interrupt is then not
   longer handle_edge_irq() as it should not issue the ack() again.

   From a correctness POV vs. interrupt handling of a edge type
   interrupt, that's perfectly fine.

#2 Trivial enough to solve by flushing and invalidating the irq_work
   before shutdown/removal.

#3 That might be an issue for some extreme scenarios, but in the general
   case I claim, that it does not matter at all.

   Those scenarios where it really matters can affine the interrupt to
   the affinity of the demultiplexing interrupt, which leads to #4

#4 That's a trivial problem to solve. There is absolutely no technical
   reason to make them hidden. Demultiplexing just works fine from the
   context of regular interrupts.

There are some other minor issues like CPU hotplug, but I can't see
anything truly complex in this design right now. Famous last words :)

At the driver level this becomes really simple. Affinity changes are
just selecting a target CPU and store it for trivial comparison to avoid
a CPU mask check in the demultiplex path. No fiddling with sibling
or parent interrupts, no locking issues. Thread affinity and other
things just work as expected.

I'm really tempted to play with that. Such infrastructure would allow to
support PCI/multi-MSI on x86 without interrupt remapping, which is on
the wishlist of a lot of people for many years.

Thanks,

        tglx