[+cc Rafael] On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote: > PCI core allows users to configure the D3Cold state for each PCI device > through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This > attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users > to allow/disallow the PCI devices to enter D3Cold during system suspend. > > So make use of this flag in the NVMe driver to shutdown the NVMe device > during system suspend if the user has allowed D3Cold for the device. > Existing checks in the NVMe driver decide whether to shut down the device > (based on platform/device limitations), so use this flag as the last resort > to keep the existing behavior. > > The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow > D3Cold and the users can disallow it through sysfs if they want. What problem does this solve? I guess there must be a case where suspend leaves NVMe in a higher power state than you want? What does it mean that this is the "last resort to keep the existing behavior"? All the checks are OR'd together and none have side effects, so the order doesn't really matter. It changes the existing behavior *unless* the user has explicitly cleared d3cold_allowed via sysfs. pdev->d3cold_allowed is set by default, so I guess this change means that unless the user clears d3cold_allowed, we let the PCI core decide the suspend state instead of managing it directly in nvme_suspend()? > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> > --- > drivers/nvme/host/pci.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 4b9fda0b1d9a..a4d4687854bf 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev) > */ > if (pm_suspend_via_firmware() || !ctrl->npss || > !pcie_aspm_enabled(pdev) || > - (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND)) > + (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) || > + pdev->d3cold_allowed) > return nvme_disable_prepare_reset(ndev, true); I guess your intent is that if users prevent use of D3cold via sysfs, we'll use the NVMe-specific power states, and otherwise, the PCI core will decide? I think returning 0 here means the PCI core decides what state to use in the pci_pm_suspend_noirq() -> pci_prepare_to_sleep() path. This could be any state from D0 to D3cold depending on platform support and wakeup considerations (see pci_target_state()). I'm not sure the use of pdev->d3cold_allowed here really expresses your underlying intent. It suggests that you're really hoping for D3cold, but that's only a possibility if firmware supports it, and we have no visibility into that here. It also seems contrary to the earlier comment that suggests we prefer host managed nvme power settings: * The platform does not remove power for a kernel managed suspend so * use host managed nvme power settings for lowest idle power if * possible. This should have quicker resume latency than a full device * shutdown. But if the firmware is involved after the suspend or the * device does not support any non-default power states, shut down the * device fully. Bjorn