Re: [RFC] PCI/AER: Block runtime suspend when handling errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 23, 2024 at 10:00:16AM +0100, Stanislaw Gruszka wrote:
> PM runtime can be done simultaneously with AER error handling.
> Avoid that by using pm_runtime_get_sync() just after pci_dev_get()
> and pm_runtime_put() just before pci_dev_put() in AER recovery
> procedures.

I guess there must be a general rule here, like "PCI core must use
pm_runtime_get_sync() whenever touching the device asynchronously,
i.e., when it's doing something unrelated to a call from the driver"?

Probably would apply to all subsystem cores, not just PCI.

> I'm not sure about DPC case since I do not see get/put there. It
> just call pci_do_recovery() from threaded irq dcd_handler().
> I think pm_runtime* should be added to this handler as well.

s/dcd_handler/dpc_handler/

I'm guessing the "threaded" part really doesn't matter; just the fact
that this is in response to an interrupt, not something directly
called by a driver?

> pm_runtime_get_sync() will increase dev->power.usage_count counter to
> prevent any rpm actives. When there is suspending pending, it will wait
> for it and do the rpm resume. Not sure if that problem, on my testing
> I did not encounter issues with that.

Sorry, I didn't catch your meaning here.  IIUC, you can reproduce the
problem with the simultaneous aer_inject and rpm suspend/resume, and
this patch fixes it.

But there's some other scenario where you *don't* see the problem?

> I tested with igc device by doing simultaneous aer_inject and 
> rpm suspend/resume via /sys/bus/pci/devices/PCI_ID/power/control
> and can reproduce: 
> 
> [  853.253938] igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
> [  853.253973] pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
> [  853.253996] pcieport 0000:00:1c.2: AER: subordinate device reset failed
> [  853.254099] pcieport 0000:00:1c.2: AER: device recovery failed
> [  853.254178] igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible

Drop the timestamps; they don't add to understanding the problem.

> The problem disappears when applied this patch.
> 
> Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@xxxxxxxxxxxxxxx>
> ---
>  drivers/pci/pcie/aer.c | 8 ++++++++
>  drivers/pci/pcie/edr.c | 3 +++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 42a3bd35a3e1..9b56460edc76 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -23,6 +23,7 @@
>  #include <linux/kernel.h>
>  #include <linux/errno.h>
>  #include <linux/pm.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/init.h>
>  #include <linux/interrupt.h>
>  #include <linux/delay.h>
> @@ -813,6 +814,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
>  {
>  	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
>  		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> +		pm_runtime_get_sync(&dev->dev);
>  		e_info->error_dev_num++;
>  		return 0;
>  	}
> @@ -1111,6 +1113,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	cxl_rch_handle_error(dev, info);
>  	pci_aer_handle_error(dev, info);
> +
> +	pm_runtime_put(&dev->dev);
>  	pci_dev_put(dev);
>  }
>  
> @@ -1143,6 +1147,8 @@ static void aer_recover_work_func(struct work_struct *work)
>  			       PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
>  			continue;
>  		}
> +		pm_runtime_get_sync(&pdev->dev);
> +
>  		pci_print_aer(pdev, entry.severity, entry.regs);
>  		/*
>  		 * Memory for aer_capability_regs(entry.regs) is being allocated from the
> @@ -1159,6 +1165,8 @@ static void aer_recover_work_func(struct work_struct *work)
>  		else if (entry.severity == AER_FATAL)
>  			pcie_do_recovery(pdev, pci_channel_io_frozen,
>  					 aer_root_reset);
> +
> +		pm_runtime_put(&pdev->dev);
>  		pci_dev_put(pdev);
>  	}
>  }
> diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
> index 5f4914d313a1..bd96babd7249 100644
> --- a/drivers/pci/pcie/edr.c
> +++ b/drivers/pci/pcie/edr.c
> @@ -10,6 +10,7 @@
>  
>  #include <linux/pci.h>
>  #include <linux/pci-acpi.h>
> +#include <linux/pm_runtime.h>
>  
>  #include "portdrv.h"
>  #include "../pci.h"
> @@ -169,6 +170,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>  		return;
>  	}
>  
> +	pm_runtime_get_sync(&edev->dev);
>  	pci_dbg(pdev, "Reported EDR dev: %s\n", pci_name(edev));
>  
>  	/* If port does not support DPC, just send the OST */
> @@ -209,6 +211,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>  		acpi_send_edr_status(pdev, edev, EDR_OST_FAILED);
>  	}
>  
> +	pm_runtime_put(&edev->dev);
>  	pci_dev_put(edev);
>  }
>  
> -- 
> 2.34.1
> 




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux