Re: [PATCH] PCI: dwc: Chain the set IRQ affinity request back to the parent

Tsai Sung-Fu <danielsftsai@xxxxxxxxxx> · Wed, 5 Mar 2025 19:21:36 +0800

On Tue, Mar 4, 2025 at 5:46 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> On Tue, Mar 04 2025 at 13:48, Tsai Sung-Fu wrote:
> > On Mon, Mar 3, 2025 at 5:10 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >> > +     hwirq = ctrl * MAX_MSI_IRQS_PER_CTRL;
> >> > +     end = hwirq + MAX_MSI_IRQS_PER_CTRL;
> >> > +     for_each_set_bit_from(hwirq, pp->msi_irq_in_use, end) {
> >> > +             if (hwirq == hwirq_to_check)
> >> > +                     continue;
> >> > +             virq = irq_find_mapping(pp->irq_domain, hwirq);
> >> > +             if (!virq)
> >> > +                     continue;
> >> > +             mask = irq_get_affinity_mask(virq);
> >>
> >> What protects @mask against a concurrent modification?
> >
> > We hold the &pp->lock in the dw_pci_msi_set_affinity before calling
> > this function
> > so that should help to protect against concurrent modification ?
>
> So that's the same domain, right?
>

Yes, all from pp->irq_domain allocated by the driver by calling to
irq_domain_create_linear().

I will add the lockdep_assert_held() check.

> >> > +     /*
> >> > +      * Update all the irq_data's effective mask
> >> > +      * bind to this MSI controller, so the correct
> >> > +      * affinity would reflect on
> >> > +      * /proc/irq/XXX/effective_affinity
> >> > +      */
> >> > +     hwirq = ctrl * MAX_MSI_IRQS_PER_CTRL;
> >> > +     end = hwirq + MAX_MSI_IRQS_PER_CTRL;
> >> > +     for_each_set_bit_from(hwirq, pp->msi_irq_in_use, end) {
> >> > +             virq_downstream = irq_find_mapping(pp->irq_domain, hwirq);
> >> > +             if (!virq_downstream)
> >> > +                     continue;
> >> > +             desc_downstream = irq_to_desc(virq_downstream);
> >> > +             irq_data_update_effective_affinity(&desc_downstream->irq_data,
> >> > +                                                effective_mask);
> >>
> >> Same here.
> >
> > We hold the &pp->lock in the dw_pci_msi_set_affinity before calling
> > here, so that could help I think ?
>
> A comment would be helpful for the casual reader.

Sure, will also add the lockdep_assert_held() check

>
> >>
> >> > +     }
> >> > +}
> >> > +
> >> > +static int dw_pci_msi_set_affinity(struct irq_data *d,
> >> > +                                const struct cpumask *mask, bool force)
> >> > +{
> >> > +     struct dw_pcie_rp *pp = irq_data_get_irq_chip_data(d);
> >> > +     struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> >> > +     int ret;
> >> > +     int virq_parent;
> >> > +     unsigned long hwirq = d->hwirq;
> >> > +     unsigned long flags, ctrl;
> >> > +     struct irq_desc *desc_parent;
> >> > +     const struct cpumask *effective_mask;
> >> > +     cpumask_var_t mask_result;
> >> > +
> >> > +     ctrl = hwirq / MAX_MSI_IRQS_PER_CTRL;
> >> > +     if (!alloc_cpumask_var(&mask_result, GFP_ATOMIC))
> >> > +             return -ENOMEM;
> >>
> >> This does not work on a RT enabled kernel. Allocations with a raw spin
> >> lock held are not possible.
> >>
> >
> > Even with the GFP_ATOMIC flag ? I thought that means it would be safe
> > to do allocation in the atomic context ?
> > Didn't work on the RT kernel before, so please enlighten me if I am
> > wrong and if you don't mind.
>
> https://www.kernel.org/doc/html/latest/locking/locktypes.html#preempt-rt-caveats
> https://www.kernel.org/doc/html/latest/locking/locktypes.html#raw-spinlock-t-on-rt
>

I will remove the Allocation and move the cpumask to the device structure.
Thanks for the doc.

> >> > +     /*
> >> > +      * Loop through all possible MSI vector to check if the
> >> > +      * requested one is compatible with all of them
> >> > +      */
> >> > +     raw_spin_lock_irqsave(&pp->lock, flags);
> >> > +     cpumask_copy(mask_result, mask);
> >> > +     ret = dw_pci_check_mask_compatibility(pp, ctrl, hwirq, mask_result);
> >> > +     if (ret) {
> >> > +             dev_dbg(pci->dev, "Incompatible mask, request %*pbl, irq num %u\n",
> >> > +                     cpumask_pr_args(mask), d->irq);
> >> > +             goto unlock;
> >> > +     }
> >> > +
> >> > +     dev_dbg(pci->dev, "Final mask, request %*pbl, irq num %u\n",
> >> > +             cpumask_pr_args(mask_result), d->irq);
> >> > +
> >> > +     virq_parent = pp->msi_irq[ctrl];
> >> > +     desc_parent = irq_to_desc(virq_parent);
> >> > +     ret = desc_parent->irq_data.chip->irq_set_affinity(&desc_parent->irq_data,
> >> > +                                                        mask_result, force);
> >>
> >> Again. Completely unserialized.
> >>
> >
> > The reason why we remove the desc_parent->lock protection is because
> > the chained IRQ structure didn't export parent IRQ to the user space, so we
> > think this call path should be the only caller. And since we have &pp->lock hold
> > at the beginning, so that should help to protect against concurrent
> > modification here ?
>
> "Should be the only caller" is not really a technical argument. If you
> make assumptions like this, then you have to come up with a proper
> explanation why this is correct under all circumstances and with all
> possible variants of parent interrupt chips.
>
> Aside of that fiddling with the internals of interrupt descriptors is
> not really justified here. What's wrong with using the existing
> irq_set_affinity() mechanism?
>
> Thanks,
>
>         tglx
>

Using irq_set_affinity() would give us some circular deadlock warning
at the probe stage.
irq_startup() flow would hold the global lock from irq_setup_affinity(), so
deadlock scenario like this below might happen. The irq_set_affinity()
would try to acquire &irq_desc_lock_class
while the dw_pci_msi_set_affinity() already hold the &pp->lock. To
resolve that we would like to reuse the idea to
replace the global lock in irq_set_affinity() with PERCPU structure
from this patch ->
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=64b6d1d7a84538de34c22a6fc92a7dcc2b196b64
Just would like to know if this looks feasible to you ?

  Chain exists of:
  &irq_desc_lock_class --> mask_lock --> &pp->lock
  Possible unsafe locking scenario:
        CPU0                            CPU1
        ----                                   ----
   lock(&pp->lock);
                                               lock(mask_lock);
                                               lock(&pp->lock);
   lock(&irq_desc_lock_class);

  *** DEADLOCK ***

Thanks