RE: Having problems resetting a PCI device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@xxxxxxxxxx]
> Sent: Thursday, March 30, 2017 5:42 PM
> To: Zytaruk, Kelly
> Cc: linux-pci@xxxxxxxxxxxxxxx; Alex Williamson
> Subject: Re: Having problems resetting a PCI device
> 
> On Wed, Mar 29, 2017 at 09:41:48PM +0000, Zytaruk, Kelly wrote:
> >
> >
> > > -----Original Message-----
> > > From: Bjorn Helgaas [mailto:helgaas@xxxxxxxxxx]
> > > Sent: Wednesday, March 29, 2017 4:55 PM
> > > To: Zytaruk, Kelly
> > > Cc: linux-pci@xxxxxxxxxxxxxxx; Alex Williamson
> > > Subject: Re: Having problems resetting a PCI device
> > >
> > > Hi Kelly,
> > >
> > > On Wed, Mar 29, 2017 at 08:03:33PM +0000, Zytaruk, Kelly wrote:
> > > > I have a PCI device that is sitting behind a bridge.
> > > >
> > > > Under certain reproducible circumstances the PCI device will
> > > > become inactive. Reading the PCI config space returns all 0xFFFFFFFF.
> > > >
> > > > The bridge appears to still be functional. Reading the status from
> > > > the bridge I see a Fatal Error due to a Surprise Down event.
> > >
> > > Just to be specific, is this the "Surprise Down Error" in the AER
> > > uncorrectable error status register?  "lspci -vv" probably decodes all that for
> you.
> > >
> > > > I am trying to figure out how to bring the device back online.
> > > >
> > > > I tried toggling the secondary bus reset bit of the Bridge Control
> > > > Register but it doesn't appear to make any difference. I still see
> > > > 0xFFFFFFFF in the device config space.
> > >
> > > Are you calling pci_reset_function() or doing this by hand?
> > > pci_reset_function() tries several different strategies, one of
> > > which is toggling the secondary bus reset bit.
> >
> > I just read the documentation for the call and this could be a problem
> > "The PCI device must be responsive  to PCI config space in order to use this
> function."
> >
> > In my case reading PCI config space returns all 0xFFFFFFFF
> 
> I think Surprise Down means the link is down, so you won't be able to reach the
> device at all until it gets reset.
> 
> But the secondary bus reset is done by the switch port immediately upstream
> from the device, so that should still work.  If the device still doesn't work after
> doing a secondary bus reset, maybe there's a device defect related to reset.
> 
> That port (a Root Port or Switch Downstream Port) is probably where the
> Surprise Down error was logged.  If you have CONFIG_PCIEAER turned on, I
> think the kernel should log some stuff in dmesg, hopefully including the error
> type and something that identifies the link.  Do you see any of that?

I am not seeing anything in dmesg log

> 
> If you don't have CONFIG_PCIEAER turned on, you should be able to use lspci to
> look at what's logged in the AER capability.  Unfortunately, lspci doesn't know
> how to decode everything, but you can use "lspci -xxxx" to look at it and decode
> things manually.
> 
> > > > I provided a pci_error_handler but the error_detected() function
> > > > is not getting called.
> > >
> > > Do you have CONFIG_PCIEAER turned on?  I would naively expect AER to
> > > log something and call your error_detected() function if this error
> > > occurs (but I haven't looked at the code for a long time).
> 
> 
> > > > Given that these two methods are not helping me out what other
> > > > choices do I have to either reset the PCI device or hot-plug the
> > > > device from a kernel driver. Or some other method of bring the device back
> to life.
> 
> You should be able to "echo 1 > /sys/bus/pci/devices/.../remove" to hot-unplug
> the device, then "echo 1 > /sys/bus/pci/rescan" to rediscover it.

I tried "echo 1 >remove" after the hang and it hung the Hypervisor.  The Xen log should a fault followed by a reboot about 5 second later.

I don't recall the exact message but the last entry on the stack had something to with restoring msi interrupts just before the reboot.

> 
> Bjorn




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux