We were not sure about the return just a few lines up so we did not add the 2 lines. I will try what you suggested to better understand why we are getting the extra interrupt. I am not as familiar with submitting a "proper patch" and ask that you do it if you would be so kind. -----Original Message----- From: Lukas Wunner <lukas@xxxxxxxxx> Sent: Monday, March 16, 2020 1:20 PM To: Hoyer, David <David.Hoyer@xxxxxxxxxx> Cc: linux-pci@xxxxxxxxxxxxxxx; Keith Busch <kbusch@xxxxxxxxxx> Subject: Re: Kernel hangs when powering up/down drive using sysfs NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe. On Sat, Mar 14, 2020 at 02:19:44PM +0000, Hoyer, David wrote: > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -637,6 +637,8 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) > events = atomic_xchg(&ctrl->pending_events, 0); > if (!events) { > pci_config_pm_runtime_put(pdev); > + ctrl->ist_running = false; > + wake_up(&ctrl->requester); > return IRQ_NONE; > } Thanks David for the report and sorry for the breakage. The above LGTM, please submit it as a proper patch and feel free to add my Reviewed-by. Please add the same two lines before the "return ret" a little further up in the function. If it's too cumbersome for you to submit a proper patch I can do it for you. > We've instrumented the code and we do see that pciehp_ist() runs > twice, once exiting with IRQ_HANDLED and then again with IRQ_NONE. > We believe that is due to the timing differences. Adding debug in > here changes the timings enough that the hang goes away, so we are > having troubles proving this 100% at the moment. But just based on > code inspection, if pciehp_ist() exits with the IRQ_NONE case, then > nothing will ever set ist_running=false until a subsequent hotplug > event happens that causes the IRQ_HANDLED case to run. (We were able > to prove that will cause things to "unhang" and progress at that point > - if you're hung and you remove a drive, the slot status change will > then unstick things.) The question is, why is pciehp_ist() run once more. Most likely because another event is signaled from the slot. Try adding a printk() at the top of pciehp_ist() which emits ctrl->pending_events to understand what's going on. Thanks, Lukas