RE: Kernel hangs when powering up/down drive using sysfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I ran the suggested experiment.   The first interrupt is reporting non-zero pending event (either 0x08 or 0x1000 depending on power up or power down).   The second interrupt is always zero.   So it sounds like we are getting an interrupt indicating work complete.

Power up:
Mar 16 21:10:15 eos-a kernel: pending events x8
Mar 16 21:10:15 eos-a kernel: pciehp 0000:66:09.0:pcie204: Slot(9): Card present
Mar 16 21:10:15 eos-a kernel: pci 0000:6f:00.0: Max Payload Size set to 256 (was 128, max 256)
Mar 16 21:10:15 eos-a kernel: iommu: Adding device 0000:6f:00.0 to group 70
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0: BAR 13: no space for [io  size 0x1000]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0: BAR 13: failed to assign [io  size 0x1000]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0: BAR 13: no space for [io  size 0x1000]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0: BAR 13: failed to assign [io  size 0x1000]
Mar 16 21:10:15 eos-a kernel: pci 0000:6f:00.0: BAR 0: assigned [mem 0xe0b00000-0xe0b03fff 64bit]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0: PCI bridge to [bus 6f]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0:   bridge window [mem 0xe0b00000-0xe0bfffff]
Mar 16 21:10:15 eos-a kernel: pcieport 0000:66:09.0:   bridge window [mem 0x3ac00c000000-0x3ac00dffffff 64bit pref]
Mar 16 21:10:15 eos-a kernel: pending events x0
Mar 16 21:10:16 eos-a kernel: vfio-pci 0000:6f:00.0: enabling device (0100 -> 0102)
Mar 16 21:10:16 eos-a kernel: pcieport 0000:64:00.0: can't derive routing for PCI INT A
Mar 16 21:10:16 eos-a kernel: vfio-pci 0000:6f:00.0: PCI INT A: not connected
Mar 16 21:10:16 eos-a kernel: vfio_ecap_init: 0000:6f:00.0 hiding ecap 0x19@0x178

Power down:
Mar 16 21:10:47 eos-a kernel: pending events x10000
Mar 16 21:10:47 eos-a kernel: iommu: Removing device 0000:6f:00.0 from group 70
Mar 16 21:10:48 eos-a kernel: pending events x0

-----Original Message-----
From: linux-pci-owner@xxxxxxxxxxxxxxx <linux-pci-owner@xxxxxxxxxxxxxxx> On Behalf Of Hoyer, David
Sent: Monday, March 16, 2020 1:26 PM
To: Lukas Wunner <lukas@xxxxxxxxx>
Cc: linux-pci@xxxxxxxxxxxxxxx; Keith Busch <kbusch@xxxxxxxxxx>
Subject: RE: Kernel hangs when powering up/down drive using sysfs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




We were not sure about the return just a few lines up so we did not add the 2 lines.
I will try what you suggested to better understand why we are getting the extra interrupt.

I am not as familiar with submitting a "proper patch" and ask that you do it if you would be so kind.

-----Original Message-----
From: Lukas Wunner <lukas@xxxxxxxxx>
Sent: Monday, March 16, 2020 1:20 PM
To: Hoyer, David <David.Hoyer@xxxxxxxxxx>
Cc: linux-pci@xxxxxxxxxxxxxxx; Keith Busch <kbusch@xxxxxxxxxx>
Subject: Re: Kernel hangs when powering up/down drive using sysfs

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




On Sat, Mar 14, 2020 at 02:19:44PM +0000, Hoyer, David wrote:
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -637,6 +637,8 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
>         events = atomic_xchg(&ctrl->pending_events, 0);
>         if (!events) {
>                 pci_config_pm_runtime_put(pdev);
> +               ctrl->ist_running = false;
> +               wake_up(&ctrl->requester);
>                 return IRQ_NONE;
>        }

Thanks David for the report and sorry for the breakage.

The above LGTM, please submit it as a proper patch and feel free to add my Reviewed-by.  Please add the same two lines before the "return ret" a little further up in the function.

If it's too cumbersome for you to submit a proper patch I can do it for you.


> We've instrumented the code and we do see that pciehp_ist() runs 
> twice, once exiting with IRQ_HANDLED and then again with IRQ_NONE.
> We believe that is due to the timing differences.  Adding debug in 
> here changes the timings enough that the hang goes away, so we are 
> having troubles proving this 100% at the moment.  But just based on 
> code inspection, if pciehp_ist() exits with the IRQ_NONE case, then 
> nothing will ever set ist_running=false until a subsequent hotplug 
> event happens that causes the IRQ_HANDLED case to run.  (We were able 
> to prove that will cause things to "unhang" and progress at that point
> - if you're hung and you remove a drive, the slot status change will 
> then unstick things.)

The question is, why is pciehp_ist() run once more.  Most likely because another event is signaled from the slot.  Try adding a
printk() at the top of pciehp_ist() which emits ctrl->pending_events to understand what's going on.

Thanks,

Lukas




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux