On 1/8/21 10:38 AM, Hinko Kocevar wrote:
On 1/7/21 10:42 PM, Keith Busch wrote:
On Tue, Jan 05, 2021 at 11:07:23PM +0000, Kelley, Sean V wrote:
On Jan 5, 2021, at 10:33 AM, Keith Busch <kbusch@xxxxxxxxxx> wrote:
On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
On 1/5/21 3:21 PM, Hinko Kocevar wrote:
On 1/5/21 12:02 AM, Keith Busch wrote:
Changes from v1:
Added received Acks
Split the kernel print identifying the port type being reset.
Added a patch for the portdrv to ensure the slot_reset
happens without
relying on a downstream device driver..
Keith Busch (5):
PCI/ERR: Clear status of the reporting device
PCI/AER: Actually get the root port
PCI/ERR: Retain status from error notification
PCI/AER: Specify the type of port that was reset
PCI/portdrv: Report reset for frozen channel
I removed the patch 5/5 from this patch series, and after testing
again, it
makes my setup recover from the injected error; same as observed
with v1
series.
Thanks for the notice. Unfortunately that seems even more confusing to
me right now. That patch shouldn't do anything to the devices or the
driver's state; it just ensures a recovery path that was supposed to
happen anyway. The stack trace says restoring the config space
completed
partially before getting stuck at the virtual channel capability, at
which point it appears to be in an infinite loop. I'll try to look into
it. The emulated devices I test with don't have the VC cap but I might
have real devices that do.
I’m not seeing the error either with V2 when testing with are-inject
using RCECs and an associated RCiEP.
Thank you, yes, I'm also not seeing a problem either on my end. The
sighting is still concerning though, so I'll keep looking. I may have to
request Hinko to try a debug patch to help narrow down where things have
gone wrong if that's okay.
Sure. I'm willing to help out and debug this on my side as well. Let me
know what you need me to do!
Testing this patch a bit more (without the 5/5) resulted in the same CPU
lockup:
watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:128]
as I initially reported with the 5/5 of this patch included.
It seems more infrequent, though. For example, after reboot this is not
observed and the recovery process is successful, whereas when 5/5 is
also used every recovery resulted in CPU lockup.