On Wed, Mar 5, 2025 at 2:31 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Tue, Mar 04, 2025 at 10:19:07PM +0530, Naveen Kumar P wrote: > > On Tue, Mar 4, 2025 at 1:35 PM Naveen Kumar P > > <naveenkumar.parna@xxxxxxxxx> wrote: > > ... > > > For this test run, I removed all three parameters (pcie_aspm=off, > > pci=nomsi, and pcie_ports=on) and booted with the following kernel > > command line arguments: > > > > cat /proc/cmdline > > BOOT_IMAGE=/vmlinuz-6.13.0+ root=/dev/mapper/vg00-rootvol ro quiet > > "dyndbg=file drivers/pci/* +p; file drivers/acpi/bus.c +p; file > > drivers/acpi/osl.c +p" > > > > This time, the issue occurred earlier, at 22998 seconds. Below is the > > relevant dmesg log during the ACPI_NOTIFY_BUS_CHECK event. The > > complete log is attached (dmesg_march4th_log.txt). > > > > [22998.536705] ACPI: \_SB_.PCI0.RP01: ACPI: ACPI_NOTIFY_BUS_CHECK event > > [22998.536753] ACPI: \_SB_.PCI0.RP01: ACPI: OSL: Scheduling hotplug > > event 0 for deferred handling > > [22998.536934] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bridge acquired in > > hotplug_event() > > [22998.536972] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Bus check in hotplug_event() > > [22998.537002] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Checking bridge in > > hotplug_event() > > [22998.537024] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4 > > data=0x55551556 > > [22998.537066] PCI READ: res=0, bus=01 dev=00 func=0 pos=0x00 len=4 > > data=0x55551556 > > Fine again. > > > [22998.537094] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Enabling slot in > > acpiphp_check_bridge() > > [22998.537155] ACPI: Device [PXSX] status [0000000f] > > [22998.537206] ACPI: Device [D015] status [0000000f] > > [22998.537276] ACPI: \_SB_.PCI0.RP01: acpiphp_glue: Releasing bridge > > in hotplug_event() > > > > sudo lspci -xxx -s 01:00.0 | grep 10: > > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Obviously a problem. Can you start including the whole > "lspci -x -s 01:00.0" output? Obviously the Vendor ID reads above > worked fine. I *assume* it's still fine here, and only the BARs are > zeroed out? I've captured the complete lspci output from the last run, and it is as follows: $sudo lspci -xxx -s 01:00.0 01:00.0 RAM memory: PLDA Device 5555 00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00 40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00 70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 I've also observed some inconsistencies in the behavior. In previous runs, the first invocation of lspci showed all FF's, and then the next run resulted in a PCI BAR reset, as mentioned below. Previous runs - first invocation of lspci output : -------------------------------------------------- $sudo lspci -xxx -s 01:00.0 01:00.0 RAM memory: PLDA Device 5555 (rev ff) 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 40: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 60: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 90: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Previous runs - second invocation of lspci output : -------------------------------------------------- $sudo lspci -xxx -s 01:00.0 01:00.0 RAM memory: PLDA Device 5555 00: 56 15 55 55 00 00 10 00 00 00 00 05 00 00 00 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00 40: 01 48 03 00 08 00 00 00 05 60 00 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 10 00 02 00 c2 8f 00 00 10 28 00 00 21 f4 03 00 70: 00 00 21 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 90: 20 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 However this time, the first run didn't show all FF's but instead directly resulted in a PCI BAR reset. > > I assume you saw no new dmesg logs about config accesses to the device > before the lspci. If you instrumented the user config accessors > (pci_user_read_config_*(), also in access.c), you should see those > accesses. i will try this and update you with the results soon. > > You could sprinkle some calls to early_dump_pci_device() through the > acpiphp path. Turn off the kernel config access tracing when you do > this so it doesn't clutter things up. > > What is this device? Is it a shipping product? Do you have good The PCIe device in question is a Xilinx FPGA endpoint, which is flashed with RTL code to expose several host interfaces to the system via the PCIe link. > confidence that the hardware is working correctly? I guess you said > it works correctly on a different machine with an older kernel. I > would swap the cards between machines in case one card is broken. > > You could try bisecting between the working kernel and the broken one. > It's kind of painful since it takes so long to reproduce the problem. > > Bjorn