I feel inclined to provide a little bit more info about the system I'm
running this on as it is not a regular PC/server/laptop. It is a modular
micro TCA system with a single CPU and MCH. MCH and CPU are separate cards,
as are the other processing cards (AMCs) that link up to CPU through the MCH
PEX8748 switch. I can power each card individually, or perform complete
system power cycle. The normal power up sequence is: MCH, AMCs, CPU. The CPU
is powered 30 sec after all other cards so that their PCIe links are up and
ready for Linux.
All buses below CPU side 02:01.0 are on MCH PEX8748 switch:
[dev@bd-cpu18 ~]$ sudo /usr/local/bin/pcicrawler -t
00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
│ ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
│ │ └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
│ │ ├─04:01.0 downstream_port, slot 4, power: Off
│ │ ├─04:03.0 downstream_port, slot 3, power: Off
│ │ ├─04:08.0 downstream_port, slot 5, power: Off
│ │ ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
│ │ │ └─08:00.0 endpoint, Xilinx Corporation (10ee), device 8034
│ │ └─04:12.0 downstream_port, slot 1, power: Off
│ ├─02:02.0 downstream_port, slot 2
│ ├─02:08.0 downstream_port, slot 8
│ ├─02:09.0 downstream_port, slot 9, power: Off
│ └─02:0a.0 downstream_port, slot 10
├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
└─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
The lockups most frequently appear after the cold boot of the system. If I
restart the CPU card only, and leave the MCH (where the PEX8748 switch
resides) powered, the lockups do *not* happen. I'm injecting the same error
into the root port and the system card configuration/location/count is
always the same.
Nevertheless, in rare occasions while booting the same kernel image after
complete system power cycle, no lockup is observed.
So far I observed that the lockups seem to always happen when recovery is
dealing with the 02:01.0 device/bus.
If the system recovers from a first injected error, I can repeat the
injection and the system recovers always. If the first recovery fails I have
to either reboot the CPU or power cycle the complete system.
To me it looks like this behavior is somehow related to the system/setup I
have, and for some reason is triggered by VC restoration (VC is not is use
by my system at all, AFAIK).