Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user

Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> · Wed, 27 Nov 2024 11:19:31 +0530

On Tue, Nov 26, 2024 at 11:11:38AM -0600, Bjorn Andersson wrote:
> On Sat, Nov 23, 2024 at 02:31:13PM +0530, Manivannan Sadhasivam wrote:
> > + Ulf (also interested in this topic)
> > 
> > On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> > > [+cc Rafael]
> > > 
> > > On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > > > PCI core allows users to configure the D3Cold state for each PCI device
> > > > through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
> > > > attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
> > > > to allow/disallow the PCI devices to enter D3Cold during system suspend.
> > > >
> > > > So make use of this flag in the NVMe driver to shutdown the NVMe device
> > > > during system suspend if the user has allowed D3Cold for the device.
> > > > Existing checks in the NVMe driver decide whether to shut down the device
> > > > (based on platform/device limitations), so use this flag as the last resort
> > > > to keep the existing behavior.
> > > > 
> > > > The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow
> > > > D3Cold and the users can disallow it through sysfs if they want.
> > > 
> > > What problem does this solve?  I guess there must be a case where
> > > suspend leaves NVMe in a higher power state than you want?
> > > 
> > 
> > Yeah, this is the case for all systems that doesn't fit into the existing checks
> > in the NVMe suspend path:
> > 
> > 1. ACPI based platforms
> > 2. Controller doesn't support NPSS (hardware issue/limitation)
> > 3. ASPM not enabled
> > 4. Devices/systems setting NVME_QUIRK_SIMPLE_SUSPEND flag
> > 
> > In my case, all the Qualcomm SoCs using Devicetree doesn't fall into the above
> > checks. Hence, they were not fully powered down during system suspend and always
> > in low power state. This means, I cannot achieve 'CX power collapse', a Qualcomm
> > specific SoC powered down state that consumes just enough power to wake up the
> > SoC. Since the controller driver keeps the PCI resource vote because of NVMe,
> > the firmware in the Qualcomm SoCs cannot put the SoC into above mentioned low
> > power state.
> > 
> 
> Per our previous discussions, I think we concluded that we have two
> cases:
> a) of_machine_is_compatible("qcom,sc8280xp")
> b) Everything else
> 
> In #a we have to power down NVMe as the link doesn't survive the CX
> collapse, is this your #2?. For #b it's primarily a policy decision
> about the tradeoff between battery and flash life (and in some cases, as
> already seen in the nvme driver, some device-specific policy).
> 

With this proposed usage of the 'd3cold_allowed' sysfs attribute, all devices
will enter D3Cold by default, unless overridden by userspace. So #a will be
covered by default, and #b is left to userspace.

> > > What does it mean that this is the "last resort to keep the existing
> > > behavior"?  All the checks are OR'd together and none have side
> > > effects, so the order doesn't really matter.  It changes the existing
> > > behavior *unless* the user has explicitly cleared d3cold_allowed via
> > > sysfs.
> > > 
> > 
> > Since the checks are ORed, this new check is not going to change the existing
> > behavior for systems satisfying above checks i.e., even if the user changes the
> > flag to forbid D3Cold, it won't affect them and it *shoudn't*.
> > 
> > > pdev->d3cold_allowed is set by default, so I guess this change means
> > > that unless the user clears d3cold_allowed, we let the PCI core decide
> > > the suspend state instead of managing it directly in nvme_suspend()?
> > > 
> > 
> > If 'pdev->d3cold_allowed' is set, then NVMe driver will shutdown the device and
> > the PCI controller driver can turn off all bus specific resources. Otherwise,
> > NVMe driver will put the device into low power mode and the controller driver
> > has to do something similar to retain the device power.
> > 
> > > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx>
> > > > ---
> > > >  drivers/nvme/host/pci.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > > index 4b9fda0b1d9a..a4d4687854bf 100644
> > > > --- a/drivers/nvme/host/pci.c
> > > > +++ b/drivers/nvme/host/pci.c
> > > > @@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev)
> > > >  	 */
> > > >  	if (pm_suspend_via_firmware() || !ctrl->npss ||
> > > >  	    !pcie_aspm_enabled(pdev) ||
> > > > -	    (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
> > > > +	    (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) ||
> > > > +	    pdev->d3cold_allowed)
> > > >  		return nvme_disable_prepare_reset(ndev, true);
> > > 
> > > I guess your intent is that if users prevent use of D3cold via sysfs,
> > > we'll use the NVMe-specific power states, and otherwise, the PCI core
> > > will decide?
> > > 
> > 
> > Not PCI core, but the controller drivers actually which takes care of powering
> > down the PCI bus resources.
> > 
> > > I think returning 0 here means the PCI core decides what state to use
> > > in the pci_pm_suspend_noirq() -> pci_prepare_to_sleep() path.  This
> > > could be any state from D0 to D3cold depending on platform support and
> > > wakeup considerations (see pci_target_state()).
> > > 
> > > I'm not sure the use of pdev->d3cold_allowed here really expresses
> > > your underlying intent.  It suggests that you're really hoping for
> > > D3cold, but that's only a possibility if firmware supports it, and we
> > > have no visibility into that here.
> > > 
> > 
> > I'm not relying on firmware to do anything here. If firmware has to decide the
> > suspend state, it should already satisfy the pm_suspend_via_firmware() check in
> > nvme_suspend(). Here, the controller driver takes care of putting the device
> > into D3Cold. Currently, the controller drivers cannot do it (on DT platforms)
> > because of NVMe driver's behavior.
> > 
> > > It also seems contrary to the earlier comment that suggests we prefer
> > > host managed nvme power settings:
> > > 
> > >   * The platform does not remove power for a kernel managed suspend so
> > >   * use host managed nvme power settings for lowest idle power if
> > >   * possible. This should have quicker resume latency than a full device
> > >   * shutdown.  But if the firmware is involved after the suspend or the
> > >   * device does not support any non-default power states, shut down the
> > >   * device fully.
> > 
> > This above comment satisfies the ACPI platforms as the firmware controls the
> > suspend state. On DT platforms, even though the firmware takes care of suspend
> > state, it still relies on the controller driver to relinquish the votes for PCI
> > resources. Only then, the firmware will put the whole SoC in power down mode
> > a.k.a CX power collapse mode in Qcom SoCs.
> > 
> > We did attempt so solve this problem in multiple ways, but the lesson learned
> > was, kernel cannot decide the power mode without help from userspace. That's the
> > reason I wanted to make use of this 'd3cold_allowed' sysfs attribute to allow
> > userspace to override the D3Cold if it wants based on platform requirement.
> > 
> 
> What do you mean "without help from userspace". Which entity in
> userspace is going to help make this decision?
> 

The entities controlling the similar UFS 'spm_lvl' attributes (Android userspace
is one such example). Right now, most of the platforms (not just Qcom) require
the PCIe devices to be powered down during system suspend. So no userspace
intervention is required. But if someone puts the same chipsets on Android
form factor devices, then they need to teach the Android userspace to control
this attribute similar to how it is done for UFS.

> > This is similar to how UFS allows users to configure power states of both the
> > device and controller:
> > 
> 
> "Allow users to configure" is an optimization, is the purpose of this
> patch to allow similar kind of optimization? Or is it supposed to allow
> userspace to help solve case #a above (hardware doesn't survive CX
> collapse)?
> 

Both as explained above.

- Mani

-- 
மணிவண்ணன் சதாசிவம்