Re: [RFC] PCI/AER: Block runtime suspend when handling errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Feb 06, 2024 at 04:37:36PM +0100, Stanislaw Gruszka wrote:
> > > > If this is a real possibility (I mean, device in a low-power state and
> > > > in an error state at the same time), it would be better to call
> > > > __pm_runtime_disable(dev, false) instead of pm_runtime_get_sync(), as
> > > > that would prevent runtime PM from changing the device state.
> > >
> > > __pm_runtime_disable(dev, false) does not work as reliable in my
> > > test as pm_runtime_get_sync(), the
> > >
> > > igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible
> > >
> > > message disappears, but sill have:
> > >
> > > igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
> > > pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
> > > pcieport 0000:00:1c.2: AER: subordinate device reset failed
> > > pcieport 0000:00:1c.2: AER: device recovery fail
> > 
> > But what exactly do you do?
> > 
> > (1) __pm_runtime_disable(dev, false)
> > (2) Check power state
> >      (a) If D0 (and device runtime-active), proceed
> >      (b) If > D0, remove power (if possible) and put into D0
> > 
> > or something else?
> 
> I just did point (1), did not check power state (2).
> I tested below patch with replaced:
> 
>   pm_runtime_get_sync -> __pm_runtime_disable(false)
>   pm_runtime_put -> pm_runtime_enable()
> 
> I could try to test with (1)+(2), but now sure how do do step (2b),
> by:
> 
> pci_set_power_state(D3cold)
> pci_set_power_state(D0)

The problematic case is indeed when after __pm_runtime_disable(), device
is in D3hot state. In such case we need to do the same operations as
pci_pm_runtime_resume() does, otherwise AER code is not able to work.
I think, just doing pm_runtime_get_sync() is better.

While I'm able to reproduce D3hot & error state using aer_inject
on the same smae device, more practical case is recovery running on all
devices connected to a port (else case in pcie_do_recovery 
type == PCIE_EXP_TYPE* check). On such case some devices can be suspend,
and from AER code comments I conclude they have to be reset.

> +static int pci_pm_runtime_put(struct pci_dev *pdev, void *data)
> +{
> +	pm_runtime_put(&pdev->dev);
> +	return 0;
> +}
> +
<snip>
> +
> +	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);

This can happen after driver is unbind from device. I had concern about
that, but after drivers/base/ code examination, seems to be fine do
do pm_runtime_put() after unbind.

Regards
Stanislaw




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux