On Sun, Jan 05, 2025 at 06:54:24PM +0200, Ilpo Järvinen wrote: > Indeed, it certainly didn't occur to me while arranging the code the way > it is that there are other sources for the same irq. However, there is a > reason those lines where within the same critical section (I also realized > it's not documented anywhere): > > As bwctrl has two operating modes, one with BW notifications and the other > without them, there are races when switching between those modes during > probe wrt. call to lbms counting accessor, and I reused those rw > semaphores to prevent those race (the race fixes were noted only in a > history bullet of the bwctrl series). Could you add code comment(s) to document this? I've respun the patch, but of course yesterday was a holiday in Finland. So I'm hoping you get a chance to review the v2 patch today. It seems pcie_bwctrl_setspeed_rwsem is only needed because pcie_retrain_link() calls pcie_reset_lbms_count(), which would recursively acquire pcie_bwctrl_lbms_rwsem. There are only two callers of pcie_retrain_link(), so I'm wondering if the invocation of pcie_reset_lbms_count() can be moved to them, thus avoiding the recursive lock acquisition and allowing to get rid of pcie_bwctrl_setspeed_rwsem. An alternative would be to have a __pcie_retrain_link() helper which doesn't call pcie_reset_lbms_count(). Right now there are no less than three locks used by bwctrl (the two global rwsem plus the per-port mutex). That doesn't look elegant and makes it difficult to reason about the code, so simplifying the locking would be desirable I think. I'm also wondering if the IRQ handler really needs to run in hardirq context. Is there a reason it can't run in thread context? Note that CONFIG_PREEMPT_RT=y (as well as the "threadirqs" command line option) cause the handler to be run in thread context, so it must work properly in that situation as well. Another oddity that caught my eye is the counting of the interrupts. It seems the only place where lbms_count is read is the pcie_failed_link_retrain() quirk, and it only cares about the count being non-zero. So this could be a bit in pci_dev->priv_flags that's accessed with set_bit() / test_bit() similar to pci_dev_assign_added() / pci_dev_is_added(). Are you planning on using the count for something else in the future? If not, using a flag would be simpler and more economical memory-wise. I'm also worried about the lbms_count overflowing. Because there's hardware which signals an interrupt before actually setting one of the two bits in the Link Status Register, I'm wondering if it would make sense to poll the register a couple of times in the irq handler. Obviously this is only an option if the handler is running in thread context. What was the maximum time you saw during testing that it took to set the LBMS bit belatedly? If you don't poll for the LBMS bit, then you definitely should clear it on unbind in case it contains a stale 1. Or probably clear it in any case. Thanks, Lukas