On Sat, Nov 23, 2024 at 02:31:13PM +0530, Manivannan Sadhasivam wrote: > + Ulf (also interested in this topic) > > On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote: > > [+cc Rafael] > > > > On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote: > > > PCI core allows users to configure the D3Cold state for each PCI device > > > through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This > > > attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users > > > to allow/disallow the PCI devices to enter D3Cold during system suspend. > > > > > > So make use of this flag in the NVMe driver to shutdown the NVMe device > > > during system suspend if the user has allowed D3Cold for the device. > > > Existing checks in the NVMe driver decide whether to shut down the device > > > (based on platform/device limitations), so use this flag as the last resort > > > to keep the existing behavior. > > > > > > The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow > > > D3Cold and the users can disallow it through sysfs if they want. > > > > What problem does this solve? I guess there must be a case where > > suspend leaves NVMe in a higher power state than you want? > > > > Yeah, this is the case for all systems that doesn't fit into the existing checks > in the NVMe suspend path: > > 1. ACPI based platforms > 2. Controller doesn't support NPSS (hardware issue/limitation) > 3. ASPM not enabled > 4. Devices/systems setting NVME_QUIRK_SIMPLE_SUSPEND flag > > In my case, all the Qualcomm SoCs using Devicetree doesn't fall into the above > checks. Hence, they were not fully powered down during system suspend and always > in low power state. This means, I cannot achieve 'CX power collapse', a Qualcomm > specific SoC powered down state that consumes just enough power to wake up the > SoC. Since the controller driver keeps the PCI resource vote because of NVMe, > the firmware in the Qualcomm SoCs cannot put the SoC into above mentioned low > power state. > Per our previous discussions, I think we concluded that we have two cases: a) of_machine_is_compatible("qcom,sc8280xp") b) Everything else In #a we have to power down NVMe as the link doesn't survive the CX collapse, is this your #2?. For #b it's primarily a policy decision about the tradeoff between battery and flash life (and in some cases, as already seen in the nvme driver, some device-specific policy). > > What does it mean that this is the "last resort to keep the existing > > behavior"? All the checks are OR'd together and none have side > > effects, so the order doesn't really matter. It changes the existing > > behavior *unless* the user has explicitly cleared d3cold_allowed via > > sysfs. > > > > Since the checks are ORed, this new check is not going to change the existing > behavior for systems satisfying above checks i.e., even if the user changes the > flag to forbid D3Cold, it won't affect them and it *shoudn't*. > > > pdev->d3cold_allowed is set by default, so I guess this change means > > that unless the user clears d3cold_allowed, we let the PCI core decide > > the suspend state instead of managing it directly in nvme_suspend()? > > > > If 'pdev->d3cold_allowed' is set, then NVMe driver will shutdown the device and > the PCI controller driver can turn off all bus specific resources. Otherwise, > NVMe driver will put the device into low power mode and the controller driver > has to do something similar to retain the device power. > > > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> > > > --- > > > drivers/nvme/host/pci.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > > index 4b9fda0b1d9a..a4d4687854bf 100644 > > > --- a/drivers/nvme/host/pci.c > > > +++ b/drivers/nvme/host/pci.c > > > @@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev) > > > */ > > > if (pm_suspend_via_firmware() || !ctrl->npss || > > > !pcie_aspm_enabled(pdev) || > > > - (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND)) > > > + (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) || > > > + pdev->d3cold_allowed) > > > return nvme_disable_prepare_reset(ndev, true); > > > > I guess your intent is that if users prevent use of D3cold via sysfs, > > we'll use the NVMe-specific power states, and otherwise, the PCI core > > will decide? > > > > Not PCI core, but the controller drivers actually which takes care of powering > down the PCI bus resources. > > > I think returning 0 here means the PCI core decides what state to use > > in the pci_pm_suspend_noirq() -> pci_prepare_to_sleep() path. This > > could be any state from D0 to D3cold depending on platform support and > > wakeup considerations (see pci_target_state()). > > > > I'm not sure the use of pdev->d3cold_allowed here really expresses > > your underlying intent. It suggests that you're really hoping for > > D3cold, but that's only a possibility if firmware supports it, and we > > have no visibility into that here. > > > > I'm not relying on firmware to do anything here. If firmware has to decide the > suspend state, it should already satisfy the pm_suspend_via_firmware() check in > nvme_suspend(). Here, the controller driver takes care of putting the device > into D3Cold. Currently, the controller drivers cannot do it (on DT platforms) > because of NVMe driver's behavior. > > > It also seems contrary to the earlier comment that suggests we prefer > > host managed nvme power settings: > > > > * The platform does not remove power for a kernel managed suspend so > > * use host managed nvme power settings for lowest idle power if > > * possible. This should have quicker resume latency than a full device > > * shutdown. But if the firmware is involved after the suspend or the > > * device does not support any non-default power states, shut down the > > * device fully. > > This above comment satisfies the ACPI platforms as the firmware controls the > suspend state. On DT platforms, even though the firmware takes care of suspend > state, it still relies on the controller driver to relinquish the votes for PCI > resources. Only then, the firmware will put the whole SoC in power down mode > a.k.a CX power collapse mode in Qcom SoCs. > > We did attempt so solve this problem in multiple ways, but the lesson learned > was, kernel cannot decide the power mode without help from userspace. That's the > reason I wanted to make use of this 'd3cold_allowed' sysfs attribute to allow > userspace to override the D3Cold if it wants based on platform requirement. > What do you mean "without help from userspace". Which entity in userspace is going to help make this decision? > This is similar to how UFS allows users to configure power states of both the > device and controller: > "Allow users to configure" is an optimization, is the purpose of this patch to allow similar kind of optimization? Or is it supposed to allow userspace to help solve case #a above (hardware doesn't survive CX collapse)? Regards, Bjorn > /sys/bus/platform/drivers/ufshcd/*/spm_lvl > /sys/bus/platform/devices/*.ufs/spm_lvl > > If the 'd3cold_allowed' attribute is not a good fit for this usecase, then I'd > like to introduce a new attribute similar to UFS. > > - Mani > > -- > மணிவண்ணன் சதாசிவம்