On 08/01/2018 10:24 AM, Thomas Tai wrote:
On 08/01/2018 01:53 AM, gokul cg wrote:
Hi,
I see there is a basic design flow. As AER and PCI drivers are
independent modules ,
locally storing pointer to any data structure from pci linked list in
AER driver will create problem as there is no synchronization between
the same .
https://elixir.bootlin.com/linux/v3.10.99/source/drivers/pci/pcie/aer/aerdrv_core.c#L701
Here 'structaer_err_info
<https://elixir.bootlin.com/linux/v3.10.99/ident/aer_err_info>*e_info
<https://elixir.bootlin.com/linux/v3.10.99/ident/e_info>' has pointer
to pci dev , which can be removed from pci tree at any time .
I think this is the basic issue.
Hi Gokul,
I am afraid that I am having hard time recreating your issue. Following
is the normal situation and wondering did you see any hotplug message
before the aer message?
pcieport 0000:00:02.2: AER: Corrected error received: id=1130
pciehp 0000:11:06.0:pcie204: Slot(102): Link Down
pciehp 0000:11:06.0:pcie204: Slot(102): Link Down event ignored; already
powering off
pcieport 0000:11:06.0: PCIe Bus Error: severity=Corrected, type=Physical
Layer, id=1130(Receiver ID)
pcieport 0000:11:06.0: device [111d:80b5] error
status/mask=00000001/0000e000
pcieport 0000:11:06.0: [ 0] Receiver Error
As far as the pci_dev being corrupted, may be you can add
"slub_debug=FZP" in your kernel boot argument and rerun your test and
see if it find anything. I am curious that who corrupted the pci_dev in
the first place. I am not totally convinced that the problem is in the
AER codes.
Cheers,
Thomas