Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset

Naveen Kumar P <naveenkumar.parna@xxxxxxxxx> · Wed, 5 Mar 2025 04:14:17 +0530

On Wed, Mar 5, 2025 at 2:31 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Tue, Mar 04, 2025 at 10:19:07PM +0530, Naveen Kumar P wrote:
> > On Tue, Mar 4, 2025 at 1:35 PM Naveen Kumar P
> > <naveenkumar.parna@xxxxxxxxx> wrote:
> > ...
>
> > For this test run, I removed all three parameters (pcie_aspm=off,
> > pci=nomsi, and pcie_ports=on) and booted with the following kernel
> > command line arguments:
> >
> > cat /proc/cmdline
> > BOOT_IMAGE=/vmlinuz-6.13.0+ root=/dev/mapper/vg00-rootvol ro quiet
> > "dyndbg=file drivers/pci/* +p; file drivers/acpi/bus.c +p; file
> > drivers/acpi/osl.c +p"
> >
> > This time, the issue occurred earlier, at 22998 seconds. Below is the
> > relevant dmesg log during the ACPI_NOTIFY_BUS_CHECK event. The
> > complete log is attached (dmesg_march4th_log.txt).
> >
> > [22998.536705] ACPI: \_SB_.PCI0.RP01: ACPI: ACPI_NOTIFY_BUS_CHECK event
> > [22998.536753] ACPI: \_SB_.PCI0.RP01: ACPI: OSL: Scheduling hotplug
> > event 0 for deferred handling
> > [22998.536934] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bridge acquired in
> > hotplug_event()
> > [22998.536972] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event()
> > [22998.537002] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Checking bridge in
> > hotplug_event()
> > [22998.537024] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4
> > data=0x55551556
> > [22998.537066] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4
> > data=0x55551556
>
> Fine again.
>
> > [22998.537094] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Enabling slot in
> > acpiphp_check_bridge()
> > [22998.537155] ACPI: Device [PXSX] status [0000000f]
> > [22998.537206] ACPI: Device [D015] status [0000000f]
> > [22998.537276] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Releasing bridge
> > in hotplug_event()
> >
> > sudo lspci -xxx -s 01:00.0 | grep 10:
> > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
> Obviously a problem.  Can you start including the whole
> "lspci -x -s 01:00.0" output?  Obviously the Vendor ID reads above
> worked fine.  I *assume* it's still fine here, and only the BARs are
> zeroed out?
I've captured the complete lspci output from the last run, and it is as follows:

$sudo lspci -xxx -s 01:00.0
01:00.0 RAM memory: PLDA Device 5555
00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00
70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

I've also observed some inconsistencies in the behavior. In previous
runs, the first invocation of lspci showed all FF's, and then the next
run resulted in a PCI BAR reset, as mentioned below.

Previous runs - first invocation of lspci output :
--------------------------------------------------
$sudo lspci -xxx -s 01:00.0
01:00.0 RAM memory: PLDA Device 5555 (rev ff)
00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
40: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
60: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
90: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

Previous runs - second invocation of lspci output :
--------------------------------------------------
$sudo lspci -xxx -s 01:00.0
01:00.0 RAM memory: PLDA Device 5555
00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00
70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

However this time, the first run didn't show all FF's but instead
directly resulted in a PCI BAR reset.

>
> I assume you saw no new dmesg logs about config accesses to the device
> before the lspci.  If you instrumented the user config accessors
> (pci_user_read_config_*(), also in access.c), you should see those
> accesses.
i will try this and update you with the results soon.
>
> You could sprinkle some calls to early_dump_pci_device() through the
> acpiphp path.  Turn off the kernel config access tracing when you do
> this so it doesn't clutter things up.
>
> What is this device?  Is it a shipping product?  Do you have good
The PCIe device in question is a Xilinx FPGA endpoint, which is
flashed with RTL code to expose several host interfaces to the system
via the PCIe link.

> confidence that the hardware is working correctly?  I guess you said
> it works correctly on a different machine with an older kernel.  I
> would swap the cards between machines in case one card is broken.
>
> You could try bisecting between the working kernel and the broken one.
> It's kind of painful since it takes so long to reproduce the problem.
>
> Bjorn