Re: [PATCH for-linus] PCI/bwctrl: Fix NULL pointer deref on unbind and bind

Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> · Wed, 8 Jan 2025 14:55:14 +0200 (EET)

On Wed, 8 Jan 2025, Lukas Wunner wrote:
> On Tue, Jan 07, 2025 at 04:29:18PM +0200, Ilpo Järvinen wrote:
> > On Tue, 7 Jan 2025, Lukas Wunner wrote:
> > > It seems pcie_bwctrl_setspeed_rwsem is only needed because
> > > pcie_retrain_link() calls pcie_reset_lbms_count(), which
> > > would recursively acquire pcie_bwctrl_lbms_rwsem.
> > >
> > > There are only two callers of pcie_retrain_link(), so I'm
> > > wondering if the invocation of pcie_reset_lbms_count()
> > > can be moved to them, thus avoiding the recursive lock
> > > acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem.
> > >
> > > An alternative would be to have a __pcie_retrain_link() helper
> > > which doesn't call pcie_reset_lbms_count().
> > 
> > I considered __pcie_retrain_link() variant but it felt like locking 
> > details that are internal to bwctrl would be leaking into elsewhere in the 
> > code so I had some level of dislike towards this solution, but I'm not 
> > strictly against it.
> 
> That's a fair argument.
> 
> It seems the reason you're acquiring pcie_bwctrl_lbms_rwsem in
> pcie_reset_lbms_count() is because you need to dereference
> port->link_bwctrl so that you can access port->link_bwctrl->lbms_count.
> 
> If you get rid of lbms_count and instead use a flag in pci_dev->priv_flags,
> then it seems you won't need to acquire the lock and this problem
> solves itself.

Agreed on both points.

> > > I'm also wondering if the IRQ handler really needs to run in
> > > hardirq context.  Is there a reason it can't run in thread
> > > context?  Note that CONFIG_PREEMPT_RT=y (as well as the
> > > "threadirqs" command line option) cause the handler to be run
> > > in thread context, so it must work properly in that situation
> > > as well.
> > 
> > If thread context would work now, why was the fix in the commit 
> > 3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are 
> > acked")) needed (the commit is from the bwnotif era)? What has changed 
> > since that fix?
> 
> Nothing has changed, I had forgotten about that commit.
> 
> Basically you could move everything in pcie_bwnotif_irq() after clearing
> the interrupt into an IRQ thread, but that would just be the access to the
> atomic variable and the pcie_update_link_speed() call.  That's not worth it
> because the overhead to wake the IRQ thread is bigger than just executing
> those things in the hardirq handler.
> 
> So please ignore my comment.
> 
> 
> > > Another oddity that caught my eye is the counting of the
> > > interrupts.  It seems the only place where lbms_count is read
> > > is the pcie_failed_link_retrain() quirk, and it only cares
> > > about the count being non-zero.  So this could be a bit in
> > > pci_dev->priv_flags that's accessed with set_bit() / test_bit()
> > > similar to pci_dev_assign_added() / pci_dev_is_added().
> > > 
> > > Are you planning on using the count for something else in the
> > > future?  If not, using a flag would be simpler and more economical
> > > memory-wise.
> > 
> > Somebody requested having the count exposed. For troubleshooting HW 
> > problems (IIRC), it was privately asked from me when I posted one of 
> > the early versions of the bwctrl series (so quite long time ago). I've
> > just not created that change yet to put it under sysfs.
> 
> There's a patch pending to add trace events support to native PCIe hotplug:
> 
> https://lore.kernel.org/all/20241123113108.29722-1-xueshuai@xxxxxxxxxxxxxxxxx/
> 
> If that somebody thinks they need to know how often LBMS triggered,
> we could just add similar trace events for bandwidth notifications.
> That gives us not only the count but also the precise time when the
> bandwidth change happened, so it's arguably more useful for debugging.
> 
> Trace points are patched in and out of the code path at runtime,
> so they have basically zero cost when not enabled (which would be the
> default).
> 
> 
> > > I'm also worried about the lbms_count overflowing.
> >
> > Should I perhaps simply do pci_warn() if it happens?
> 
> I'd prefere getting rid of the counter altogether. :)

Okay, it's a good suggestion. Trace events seem like a much better 
approach to the problem.

> > > Because there's hardware which signals an interrupt before actually
> > > setting one of the two bits in the Link Status Register, I'm
> > > wondering if it would make sense to poll the register a couple
> > > of times in the irq handler.  Obviously this is only an option
> > > if the handler is running in thread context.  What was the maximum
> > > time you saw during testing that it took to set the LBMS bit belatedly?
> > 
> > Is there some misunderstanding here between us because I don't think I've 
> > noticed delayed LBMS assertion? What I saw was the new Link Speed not yet 
> > updated when Link Training was already 0. In that case, the Link Status 
> > register was read inside the handler so I'd assume LBMS was set to 
> > actually trigger the interrupt, thus, not set belatedly.
> 
> Evert's laptop has BWMgmt+ ABWMgmt+ bits set on Root Port 00:02.1
> in this lspci dump:
> 
> https://bugzilla.kernel.org/attachment.cgi?id=307419&action=edit
> 
> 00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge (prog-if 00 [Normal decode])
>                 LnkSta: Speed 8GT/s, Width x4
>                         TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+
> 
> How can it be that BWMgmt+ is set?  I would have expected the bandwidth
> controller to clear that interrupt.  I can only think of two explanations:
> Either BWMgmt+ was set but no interrupt was signaled.  Or the interrupt
> was signaled and handled before BWMgmt+ was set.

Either one of those, or third alternative that it's set more than once 
and no interrupt occurs on the second assertion (could be e.g. some race 
due to write-1-to-clear and reasserting LBMS).

But, I've not seen this behavior before that report so I cannot answer to 
your question about how long it took for LBMS to get asserted.

This misbehavior is a bit problematic because those same bits are used to 
identify if the interrupt belongs for bwctrl or not. So if they're not 
set by the time the handler runs, I don't have a way to identify I'd need 
to poll those bits in the first place (and it's also known the interrupt 
is shared with other stuff :-().

> I'm guessing the latter is the case because /proc/irq/33/spurious
> indicates 1 unhandled interrupt.
> 
> Back in March 2023 when you showed me your results with various Intel
> chipsets, I thought you mentioned that you witnessed this too-early
> interrupt situation a couple of times.  But I may be misremembering.

I've seen interrupts occur before the new Link Speed has been updated but 
those still had LT=1 and there was _another_ interrupt later. Handling
that later interrupt read the new Link Speed and LT=0 so effectively 
there was just one extra interrupt. I assume it's because LBMS is simply 
asserted twice and the extra interrupt seemed harmless (no idea why HW 
ends up doing it though).

The other case, which is a real problem, had LT=0 but no new link speed 
in the Current Link Speed field. So it is "premature" in a sense but 
seemingly the training had completed. In that case, there was no second 
interrupt so the handler could not acquire the new link speed.

bwctrl does read the new link speed outside the handler by calling 
pcie_update_link_speed() which is mainly to handle misbehaviors where
BW notifications are not coming at all (if the link speed changes 
outside of bwctrl setting speed, those changes won't be detected). It will 
still be racy though if LT is deasserted before the Current Link Speed is 
set (I've not seen it to fail to get the new speed but I've not tried to 
stress test it either).

-- 
 i.