Re: PCI: hotplug_event: PCIe PLDA Device BAR Reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Feb 25, 2025 at 1:24 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Tue, Feb 25, 2025 at 12:29:00AM +0530, Naveen Kumar P wrote:
> > On Mon, Feb 24, 2025 at 11:03 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > On Mon, Feb 24, 2025 at 05:45:35PM +0530, Naveen Kumar P wrote:
> > > > On Wed, Feb 19, 2025 at 10:36 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > > > On Wed, Feb 19, 2025 at 05:52:47PM +0530, Naveen Kumar P wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > I am writing to seek assistance with an issue we are experiencing with
> > > > > > a PCIe device (PLDA Device 5555) connected through PCI Express Root
> > > > > > Port 1 to the host bridge.
> > > > > >
> > > > > > We have observed that after booting the system, the Base Address
> > > > > > Register (BAR0) memory of this device gets reset to 0x0 after
> > > > > > approximately one hour or more (the timing is inconsistent). This was
> > > > > > verified using the lspci output and the setpci -s 01:00.0
> > > > > > BASE_ADDRESS_0 command.
>
> > ...
> > I booted with the pcie_aspm=off kernel parameter, which means that
> > PCIe Active State Power Management (ASPM) is disabled. Given this
> > context, should I consider removing this setting to see if it affects
> > the occurrence of the Bus Check notifications and the BAR0 reset
> > issue?
>
> Doesn't seem likely to be related.  Once configured, ASPM operates
> without any software intervention.  But note that "pcie_aspm=off"
> means the kernel doesn't touch ASPM configuration at all, and any
> configuration done by firmware remains in effect.
>
> You can tell whether ASPM has been enabled by firmware with "sudo
> lspci -vv" before the problem occurs.
>
> > > > During the ACPI_NOTIFY_BUS_CHECK event, the lspci output initially
> > > > showed all FF's, and then the next run of the same command showed
> > > > BASE_ADDRESS_0 reset to zero:
> > > > $ sudo lspci -xxx -s 01:00.0 | grep "10:"
> > > > 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > >
> > > Looks like the device isn't responding at all here.  Could happen if
> > > the device is reset or powered down.
> >
> > From the kernel driver or user space tools, is it possible to
> > determine whether the device has been reset or powered down?  Are
> > there any power management settings or configurations that could be
> > causing the device to reset or power down unexpectedly?
>
> Not really.  By "powered down", I meant D3cold, where the main power
> is removed.  Config space is readable in all other power states.
>
> > > What is this device?  What driver is bound to it?  I don't see
> > > anything in dmesg that identifies a driver.
> >
> > The PCIe device in question is a Xilinx FPGA endpoint, which is
> > flashed with RTL code to expose several host interfaces to the system
> > via the PCIe link.
> >
> > We have an out-of-tree driver for this device, but to eliminate the
> > driver's role in this issue, I renamed the driver to prevent it from
> > loading automatically after rebooting the machine. Despite not using
> > the driver, the issue still occurred.
>
> Oh, right, I forgot that you mentioned this before.
>
> > > You're seeing the problem on v5.4 (Nov 2019), which is much newer than
> > > v4.4 (Jan 2016).  But v5.4 is still really too old to spend a lot of
> > > time on unless the problem still happens on a current kernel.
>
> This part is important.  We don't want to spend a lot of time
> debugging an issue that may have already been fixed upstream.
Sure, I started building the 6.13 kernel and will post more
information if I notice the issue on the 6.13 kernel.

Regarding the CommClk- (Common Clock Configuration) bit, it indicates
whether the common clock configuration is enabled or disabled. When it
is set to CommClk-, it means that the common clock configuration is
disabled.

LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

For my device, I noticed that the common clock configuration is
disabled. Could this be causing the BAR reset issue?

How is the CommClk bit determined(to set or clear)? and is it okay to
enable this bit after booting the kernel?

>
> Bjorn





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux