Re: [PATCH for-linus] PCI/bwctrl: Fix NULL pointer deref on unbind and bind

Lukas Wunner <lukas@xxxxxxxxx> · Tue, 7 Jan 2025 05:29:37 +0100

On Sun, Jan 05, 2025 at 06:54:24PM +0200, Ilpo Järvinen wrote:
> Indeed, it certainly didn't occur to me while arranging the code the way 
> it is that there are other sources for the same irq. However, there is a 
> reason those lines where within the same critical section (I also realized 
> it's not documented anywhere):
> 
> As bwctrl has two operating modes, one with BW notifications and the other 
> without them, there are races when switching between those modes during 
> probe wrt. call to lbms counting accessor, and I reused those rw 
> semaphores to prevent those race (the race fixes were noted only in a 
> history bullet of the bwctrl series).

Could you add code comment(s) to document this?

I've respun the patch, but of course yesterday was a holiday in Finland.
So I'm hoping you get a chance to review the v2 patch today.

It seems pcie_bwctrl_setspeed_rwsem is only needed because
pcie_retrain_link() calls pcie_reset_lbms_count(), which
would recursively acquire pcie_bwctrl_lbms_rwsem.

There are only two callers of pcie_retrain_link(), so I'm
wondering if the invocation of pcie_reset_lbms_count()
can be moved to them, thus avoiding the recursive lock
acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem.
An alternative would be to have a __pcie_retrain_link() helper
which doesn't call pcie_reset_lbms_count().

Right now there are no less than three locks used by bwctrl
(the two global rwsem plus the per-port mutex).  That doesn't
look elegant and makes it difficult to reason about the code,
so simplifying the locking would be desirable I think.

I'm also wondering if the IRQ handler really needs to run in
hardirq context.  Is there a reason it can't run in thread
context?  Note that CONFIG_PREEMPT_RT=y (as well as the
"threadirqs" command line option) cause the handler to be run
in thread context, so it must work properly in that situation
as well.

Another oddity that caught my eye is the counting of the
interrupts.  It seems the only place where lbms_count is read
is the pcie_failed_link_retrain() quirk, and it only cares
about the count being non-zero.  So this could be a bit in
pci_dev->priv_flags that's accessed with set_bit() / test_bit()
similar to pci_dev_assign_added() / pci_dev_is_added().

Are you planning on using the count for something else in the
future?  If not, using a flag would be simpler and more economical
memory-wise.  I'm also worried about the lbms_count overflowing.

Because there's hardware which signals an interrupt before actually
setting one of the two bits in the Link Status Register, I'm
wondering if it would make sense to poll the register a couple
of times in the irq handler.  Obviously this is only an option
if the handler is running in thread context.  What was the maximum
time you saw during testing that it took to set the LBMS bit belatedly?

If you don't poll for the LBMS bit, then you definitely should clear
it on unbind in case it contains a stale 1.  Or probably clear it in
any case.

Thanks,

Lukas