Re: [PATCH for-linus] PCI/bwctrl: Fix NULL pointer deref on unbind and bind

Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> · Tue, 7 Jan 2025 16:29:18 +0200 (EET)

On Tue, 7 Jan 2025, Lukas Wunner wrote:

> On Sun, Jan 05, 2025 at 06:54:24PM +0200, Ilpo Järvinen wrote:
> > Indeed, it certainly didn't occur to me while arranging the code the way 
> > it is that there are other sources for the same irq. However, there is a 
> > reason those lines where within the same critical section (I also realized 
> > it's not documented anywhere):
> > 
> > As bwctrl has two operating modes, one with BW notifications and the other 
> > without them, there are races when switching between those modes during 
> > probe wrt. call to lbms counting accessor, and I reused those rw 
> > semaphores to prevent those race (the race fixes were noted only in a 
> > history bullet of the bwctrl series).
> 
> Could you add code comment(s) to document this?

Sure, I'll do that once I've been able to clear the holiday-induced logjam
on pdx86 maintainership front. :-) (For now, I added a bullet to my todo 
list to not forget it).

> I've respun the patch, but of course yesterday was a holiday in Finland.
> So I'm hoping you get a chance to review the v2 patch today.

Done.

> It seems pcie_bwctrl_setspeed_rwsem is only needed because
> pcie_retrain_link() calls pcie_reset_lbms_count(), which
> would recursively acquire pcie_bwctrl_lbms_rwsem.
>
> There are only two callers of pcie_retrain_link(), so I'm
> wondering if the invocation of pcie_reset_lbms_count()
> can be moved to them, thus avoiding the recursive lock
> acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem.
>
> An alternative would be to have a __pcie_retrain_link() helper
> which doesn't call pcie_reset_lbms_count().
>
> Right now there are no less than three locks used by bwctrl
> (the two global rwsem plus the per-port mutex).  That doesn't
> look elegant and makes it difficult to reason about the code,
> so simplifying the locking would be desirable I think.

I considered __pcie_retrain_link() variant but it felt like locking 
details that are internal to bwctrl would be leaking into elsewhere in the 
code so I had some level of dislike towards this solution, but I'm not 
strictly against it.

It would seem most straightforward approach that wouldn't force 
moving LBMS reset to callers which feels even less elegant/obvious.
I just previously chose to keep that to complexity internal to bwctrl
but if you think adding __pcie_retrain_link() would be more elegant,
we can certainly move to that direction.

> I'm also wondering if the IRQ handler really needs to run in
> hardirq context.  Is there a reason it can't run in thread
> context?  Note that CONFIG_PREEMPT_RT=y (as well as the
> "threadirqs" command line option) cause the handler to be run
> in thread context, so it must work properly in that situation
> as well.

If thread context would work now, why was the fix in the commit 
3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are 
acked")) needed (the commit is from the bwnotif era)? What has changed 
since that fix?

I'm open to our suggestion but that existence of that fix is keeping me 
back. I just don't understand why it would work now when it didn't back 
then.

> Another oddity that caught my eye is the counting of the
> interrupts.  It seems the only place where lbms_count is read
> is the pcie_failed_link_retrain() quirk, and it only cares
> about the count being non-zero.  So this could be a bit in
> pci_dev->priv_flags that's accessed with set_bit() / test_bit()
> similar to pci_dev_assign_added() / pci_dev_is_added().
> 
> Are you planning on using the count for something else in the
> future?  If not, using a flag would be simpler and more economical
> memory-wise.

Somebody requested having the count exposed. For troubleshooting HW 
problems (IIRC), it was privately asked from me when I posted one of 
the early versions of the bwctrl series (so quite long time ago). I've
just not created that change yet to put it under sysfs.

> I'm also worried about the lbms_count overflowing.

Should I perhaps simply do pci_warn() if it happens?

> Because there's hardware which signals an interrupt before actually
> setting one of the two bits in the Link Status Register, I'm
> wondering if it would make sense to poll the register a couple
> of times in the irq handler.  Obviously this is only an option
> if the handler is running in thread context.  What was the maximum
> time you saw during testing that it took to set the LBMS bit belatedly?

Is there some misunderstanding here between us because I don't think I've 
noticed delayed LBMS assertion? What I saw was the new Link Speed not yet 
updated when Link Training was already 0. In that case, the Link Status 
register was read inside the handler so I'd assume LBMS was set to 
actually trigger the interrupt, thus, not set belatedly.

I only recall testing with reading the value again inside set speed 
functions and the Link Speed was always correct by then. I might have also 
tried polling it inside the handler but I'm sorry don't recall anymore if 
I did and what was the end result.

> If you don't poll for the LBMS bit, then you definitely should clear
> it on unbind in case it contains a stale 1.  Or probably clear it in
> any case.

-- 
 i.