Re: [PATCH for-linus] PCI/bwctrl: Fix NULL pointer deref on unbind and bind

Lukas Wunner <lukas@xxxxxxxxx> · Wed, 8 Jan 2025 13:05:59 +0100

On Tue, Jan 07, 2025 at 04:29:18PM +0200, Ilpo Järvinen wrote:
> On Tue, 7 Jan 2025, Lukas Wunner wrote:
> > It seems pcie_bwctrl_setspeed_rwsem is only needed because
> > pcie_retrain_link() calls pcie_reset_lbms_count(), which
> > would recursively acquire pcie_bwctrl_lbms_rwsem.
> >
> > There are only two callers of pcie_retrain_link(), so I'm
> > wondering if the invocation of pcie_reset_lbms_count()
> > can be moved to them, thus avoiding the recursive lock
> > acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem.
> >
> > An alternative would be to have a __pcie_retrain_link() helper
> > which doesn't call pcie_reset_lbms_count().
> 
> I considered __pcie_retrain_link() variant but it felt like locking 
> details that are internal to bwctrl would be leaking into elsewhere in the 
> code so I had some level of dislike towards this solution, but I'm not 
> strictly against it.

That's a fair argument.

It seems the reason you're acquiring pcie_bwctrl_lbms_rwsem in
pcie_reset_lbms_count() is because you need to dereference
port->link_bwctrl so that you can access port->link_bwctrl->lbms_count.

If you get rid of lbms_count and instead use a flag in pci_dev->priv_flags,
then it seems you won't need to acquire the lock and this problem
solves itself.

> > I'm also wondering if the IRQ handler really needs to run in
> > hardirq context.  Is there a reason it can't run in thread
> > context?  Note that CONFIG_PREEMPT_RT=y (as well as the
> > "threadirqs" command line option) cause the handler to be run
> > in thread context, so it must work properly in that situation
> > as well.
> 
> If thread context would work now, why was the fix in the commit 
> 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are 
> acked")) needed (the commit is from the bwnotif era)? What has changed 
> since that fix?

Nothing has changed, I had forgotten about that commit.

Basically you could move everything in pcie_bwnotif_irq() after clearing
the interrupt into an IRQ thread, but that would just be the access to the
atomic variable and the pcie_update_link_speed() call.  That's not worth it
because the overhead to wake the IRQ thread is bigger than just executing
those things in the hardirq handler.

So please ignore my comment.

> > Another oddity that caught my eye is the counting of the
> > interrupts.  It seems the only place where lbms_count is read
> > is the pcie_failed_link_retrain() quirk, and it only cares
> > about the count being non-zero.  So this could be a bit in
> > pci_dev->priv_flags that's accessed with set_bit() / test_bit()
> > similar to pci_dev_assign_added() / pci_dev_is_added().
> > 
> > Are you planning on using the count for something else in the
> > future?  If not, using a flag would be simpler and more economical
> > memory-wise.
> 
> Somebody requested having the count exposed. For troubleshooting HW 
> problems (IIRC), it was privately asked from me when I posted one of 
> the early versions of the bwctrl series (so quite long time ago). I've
> just not created that change yet to put it under sysfs.

There's a patch pending to add trace events support to native PCIe hotplug:

https://lore.kernel.org/all/20241123113108.29722-1-xueshuai@xxxxxxxxxxxxxxxxx/

If that somebody thinks they need to know how often LBMS triggered,
we could just add similar trace events for bandwidth notifications.
That gives us not only the count but also the precise time when the
bandwidth change happened, so it's arguably more useful for debugging.

Trace points are patched in and out of the code path at runtime,
so they have basically zero cost when not enabled (which would be the
default).

> > I'm also worried about the lbms_count overflowing.
>
> Should I perhaps simply do pci_warn() if it happens?

I'd prefere getting rid of the counter altogether. :)

> > Because there's hardware which signals an interrupt before actually
> > setting one of the two bits in the Link Status Register, I'm
> > wondering if it would make sense to poll the register a couple
> > of times in the irq handler.  Obviously this is only an option
> > if the handler is running in thread context.  What was the maximum
> > time you saw during testing that it took to set the LBMS bit belatedly?
> 
> Is there some misunderstanding here between us because I don't think I've 
> noticed delayed LBMS assertion? What I saw was the new Link Speed not yet 
> updated when Link Training was already 0. In that case, the Link Status 
> register was read inside the handler so I'd assume LBMS was set to 
> actually trigger the interrupt, thus, not set belatedly.

Evert's laptop has BWMgmt+ ABWMgmt+ bits set on Root Port 00:02.1
in this lspci dump:

https://bugzilla.kernel.org/attachment.cgi?id=307419&action=edit

00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode])
                LnkSta: Speed 8GT/s, Width x4
                        TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+

How can it be that BWMgmt+ is set?  I would have expected the bandwidth
controller to clear that interrupt.  I can only think of two explanations:
Either BWMgmt+ was set but no interrupt was signaled.  Or the interrupt
was signaled and handled before BWMgmt+ was set.

I'm guessing the latter is the case because /proc/irq/33/spurious
indicates 1 unhandled interrupt.

Back in March 2023 when you showed me your results with various Intel
chipsets, I thought you mentioned that you witnessed this too-early
interrupt situation a couple of times.  But I may be misremembering.

Thanks,

Lukas