Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user

"Rafael J. Wysocki" <rafael@xxxxxxxxxx> · Tue, 17 Dec 2024 20:45:55 +0100

On Tue, Dec 17, 2024 at 6:26 AM <manivannan.sadhasivam@xxxxxxxxxx> wrote:
>
> On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
>
> [...]
>
> > > There is also a case where some devices like
> > > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> > > powered down in order for the SoC to reach its low power state (CX power
> > > collapse in Qcom terms). If not, the SoC would continue to draw more power
> > > leading to battery draining quickly. This platform is supported in upstream and
> > > we keep the PCIe interconnect voted during suspend as the NVMe driver is
> > > expecting the device to retain its state during resume. Because of this
> > > requirement, this platform is not reaching SoC level low power state with
> > > upstream kernel.
> >
> > OK, now all of this makes sense and that's why you really want NVMe
> > devices to end up in some form of PCI D3 in suspend-to-idle.
> >
> > Would D3hot be sufficient for this platform or does it need to be
> > D3cold?  If the latter, what exactly is the method by which they are
> > put into D3cold?
> >
>
> D3Cold is what preferred. Earlier the controller driver used to transition the
> device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators
> etc... Now we have a new framework called 'pwrctrl' that handles the
> clock/regulators supplied to the device. So both controller and pwrctrl drivers
> need to work in a tandem to put the device into D3Cold.
>
> Once the PCIe client driver (NVMe in this case) powers down the device, then
> controller/pwrctrl drivers will check the PCIe device state and transition the
> device into D3Cold. This is a TODO btw.
>
> But right now there is no generic D3Cold handling available for DT platforms. I
> am hoping to fix that too once this NVMe issue is handled.

There's no generic D3cold handling for PCIe devices at all AFAICS.  At
least, I'm not aware of any standard way to do it.  Yes, there are
vendor-specific conventions that may even be followed by multiple
vendors, but not much beyond that.

> > > > > > > If the platform is using DT, then there is no entity setting
> > > > > > > pm_set_suspend_via_firmware().
> > > > > >
> > > > > > That's true and so the assumption is that in this case the handling of
> > > > > > all devices will always be the same regardless of which flavor of
> > > > > > system suspend is chosen by user space.
> > > > > >
> > > > >
> > > > > Right and that's why the above concern is raised.
> > > >
> > > > And it is still unclear to me what the problem with it is.
> > > >
> > > > What exactly can go wrong?
> > > >
> > > > > > > So NVMe will be kept in low power state all the
> > > > > > > time (still draining the battery).
> > > > > >
> > > > > > So what would be the problem with powering it down unconditionally?
> > > > > >
> > > > >
> > > > > I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> > > > > will wear out the NVMe devices if used in Android tablets etc...
> > > >
> > > > I understand the wear-out concern.
> > > >
> > > > Is there anything else?
> > > >
> > >
> > > No, that's the only concern.
> >
> > OK
> >
> > I think we're getting to the bottom of the issue.
> >
> > First off, there really is no requirement to avoid putting PCI devices
> > into D3hot or D3cold during suspend-to-idle.  On all modern Intel
> > platforms, PCIe devices are put into D3(hot or cold) during
> > suspend-to-idle and I don't see why this should be any different on
> > platforms from any other vendors.
> >
> > The NVMe driver is an exception and it avoids D3(hot or cold) during
> > suspend-to-idle because of problems with some hardware or platforms.
> > It might in principle allow devices to go into D3(hot or cold) during
> > suspend-to-idle, so long as it knows that this is going to work.
> >
>
> Slight correction here.
>
> NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power
> state on its own, not all the time. It has some checks [1] in the suspend path
> and if the platform/device passes one of the checks, it will power down the
> device.

Yes, there is a comment in that code explaining what's going on and
that is basically "if key ingredients are missing or the firmware is
going to do its thing anyway, don't bother".

> DT platforms doesn't pass any of the checks, so the NVMe driver always manages
> the power state on its own. Unfortunately, the resultant power saving is not
> enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down
> all the time. This is the first problem we are trying to solve.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448

I see.

This cannot be done by the driver itself, though, at least not in
general.  The PCI layer needs to be involved and, if we are talking
about D3cold, the platform firmware needs to be involved either.

As a rule, the PCI layer reaches out to the platform firmware for help
as needed and drivers don't take part in this directly.

The NVMe driver would need to let the PCI layer take over and set the
PCI power state of the device if it wanted to get any deeper than NVMe
protocol specific power states.

In principle, this can be addressed with some kind of a driver opt-in.
That is, the NVMe driver would continue to work the way it does, but
instead of completely preventing the PCI layer from taking over, it
would opt in (the exact opt-in mechanism is TBD) for D3cold if (a) the
platform firmware provides a mechanism to do so and (b) the DT
indicates that this mechanism should be used for the given device.

> > However, there is an additional concern that putting an NVMe device
> > into D3cold every time during system suspend on Android might cause it
> > to wear out more quickly.
> >
>
> Right, this is the second problem.

Let's set this one somewhat aside for now.  We'll get back to it when
we have clarity in the above.

> > Is there anything else?
>
> We also need to consider the fact that the firmware in some platforms doesn't
> support S2R. So we cannot take a decision based on S2I/S2R alone.

Right, the S2I/S2R thing is ACPI-specific.

On platforms using ACPI, pm_suspend_via_firmware() means that the
firmware is at least likely to power down the whole platform, so the
PCI layer may as well be allowed to do what it wants with the device.

> I think there are atleast a couple of ways to solve above mentioned problems:
>
> 1. Go extra mile, take account of all issues/limitations mentioned above and
> come up with an API. This API will provide a sane default suspend behavior to
> drivers (fixing first problem) and will also allow changing the behavior
> dynamically (to fix the second problem where kernel cannot distinguish Android
> vs other OS).

I don't quite follow TBH.

A "default suspend behavior" is there already and it is to allow the
PCI layer to set the device's power state (in collaboration with the
platform firmware).  Thing is, the NVMe driver doesn't always want to
rely on it.

> 2. Allow DT platforms to call pm_set_suspend_via_firmware() and make use of the
> existing behavior in the NVMe driver. This would only solve the first problem
> though. But would be a good start.

That would mean just letting the PCI layer always take over on the
platforms that would call pm_set_suspend_via_firmware().

There is a potential issue with doing it, which is that everybody
calling pm_suspend_via_firmware() would then assume that the platform
would be powered down by the firmware.  I'm not sure how much of an
issue that would be in practice, though.