On Wed, Jun 30, 2021 at 9:30 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Wed, Jun 30, 2021 at 09:59:58PM +0200, Javier Martinez Canillas wrote: > > On 6/30/21 8:59 PM, Bjorn Helgaas wrote: > > > [+cc Michal, Jingoo, Thierry, Jonathan] > > > > [snip] > > > > > > > > I think the above commit log is perfectly accurate, but all the > > > details might suggest that this is something specific to rockchip or > > > CONFIG_DEBUG_SHIRQ, which it isn't, and they might obscure the > > > fundamental problem, which is actually very simple: we registered IRQ > > > handlers before we were ready for them to be called. > > > > > > I propose the following commit log in the hope that it would help > > > other driver authors to make similar fixes: > > > > > > PCI: rockchip: Register IRQ handlers after device and data are ready > > > > > > An IRQ handler may be called at any time after it is registered, so > > > anything it relies on must be ready before registration. > > > > > > rockchip_pcie_subsys_irq_handler() and rockchip_pcie_client_irq_handler() > > > read registers in the PCIe controller, but we registered them before > > > turning on clocks to the controller. If either is called before the clocks > > > are turned on, the register reads fail and the machine hangs. > > > > > > Similarly, rockchip_pcie_legacy_int_handler() uses rockchip->irq_domain, > > > but we installed it before initializing irq_domain. > > > > > > Register IRQ handlers after their data structures are initialized and > > > clocks are enabled. > > > > > > If this is inaccurate or omits something important, let me know. I > > > can make any updates locally. > > > > > > > I think your description is accurate and agree that the commit message may > > be misleading. As you said, this is a general problem and the fact that an > > IRQ is shared and CONFIG_DEBUG_SHIRQ fires a spurious interrupt just make > > the assumptions in the driver to fall apart. > > > > But maybe you can also add a paragraph that mentions the CONFIG_DEBUG_SHIRQ > > option and shared interrupts? That way, other driver authors could know that > > by enabling this an underlying problem might be exposed for them to fix. > > Good idea, thanks! I added this; is it something like what you had in > mind? > > Found by enabling CONFIG_DEBUG_SHIRQ, which calls the IRQ handler when it > is being unregistered. An error during the probe path might cause this > unregistration and IRQ handler execution before the device or data > structure init has finished. Would it make sense to enable CONFIG_DEBUG_SHIRQ in defconfig to better pick up these problems? Peter