On Tue, Dec 17, 2024 at 08:45:55PM +0100, Rafael J. Wysocki wrote: > On Tue, Dec 17, 2024 at 6:26 AM <manivannan.sadhasivam@xxxxxxxxxx> wrote: > > > > On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote: > > > > [...] > > > > > > There is also a case where some devices like > > > > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be > > > > powered down in order for the SoC to reach its low power state (CX power > > > > collapse in Qcom terms). If not, the SoC would continue to draw more power > > > > leading to battery draining quickly. This platform is supported in upstream and > > > > we keep the PCIe interconnect voted during suspend as the NVMe driver is > > > > expecting the device to retain its state during resume. Because of this > > > > requirement, this platform is not reaching SoC level low power state with > > > > upstream kernel. > > > > > > OK, now all of this makes sense and that's why you really want NVMe > > > devices to end up in some form of PCI D3 in suspend-to-idle. > > > > > > Would D3hot be sufficient for this platform or does it need to be > > > D3cold? If the latter, what exactly is the method by which they are > > > put into D3cold? > > > > > > > D3Cold is what preferred. Earlier the controller driver used to transition the > > device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators > > etc... Now we have a new framework called 'pwrctrl' that handles the > > clock/regulators supplied to the device. So both controller and pwrctrl drivers > > need to work in a tandem to put the device into D3Cold. > > > > Once the PCIe client driver (NVMe in this case) powers down the device, then > > controller/pwrctrl drivers will check the PCIe device state and transition the > > device into D3Cold. This is a TODO btw. > > > > But right now there is no generic D3Cold handling available for DT platforms. I > > am hoping to fix that too once this NVMe issue is handled. > > There's no generic D3cold handling for PCIe devices at all AFAICS. At > least, I'm not aware of any standard way to do it. Yes, there are > vendor-specific conventions that may even be followed by multiple > vendors, but not much beyond that. > Yeah, right. Atleast ACPI has its own way of handling D3Cold and that's what I meant. There is no such option available for DT right now. I was shoping that once this NVMe issue is resolved, then I could look into D3Cold for DT platforms. > > > > > > > > If the platform is using DT, then there is no entity setting > > > > > > > > pm_set_suspend_via_firmware(). > > > > > > > > > > > > > > That's true and so the assumption is that in this case the handling of > > > > > > > all devices will always be the same regardless of which flavor of > > > > > > > system suspend is chosen by user space. > > > > > > > > > > > > > > > > > > > Right and that's why the above concern is raised. > > > > > > > > > > And it is still unclear to me what the problem with it is. > > > > > > > > > > What exactly can go wrong? > > > > > > > > > > > > > So NVMe will be kept in low power state all the > > > > > > > > time (still draining the battery). > > > > > > > > > > > > > > So what would be the problem with powering it down unconditionally? > > > > > > > > > > > > > > > > > > > I'm not sure how would you do that (by checking dev_of_node()?). Even so, it > > > > > > will wear out the NVMe devices if used in Android tablets etc... > > > > > > > > > > I understand the wear-out concern. > > > > > > > > > > Is there anything else? > > > > > > > > > > > > > No, that's the only concern. > > > > > > OK > > > > > > I think we're getting to the bottom of the issue. > > > > > > First off, there really is no requirement to avoid putting PCI devices > > > into D3hot or D3cold during suspend-to-idle. On all modern Intel > > > platforms, PCIe devices are put into D3(hot or cold) during > > > suspend-to-idle and I don't see why this should be any different on > > > platforms from any other vendors. > > > > > > The NVMe driver is an exception and it avoids D3(hot or cold) during > > > suspend-to-idle because of problems with some hardware or platforms. > > > It might in principle allow devices to go into D3(hot or cold) during > > > suspend-to-idle, so long as it knows that this is going to work. > > > > > > > Slight correction here. > > > > NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power > > state on its own, not all the time. It has some checks [1] in the suspend path > > and if the platform/device passes one of the checks, it will power down the > > device. > > Yes, there is a comment in that code explaining what's going on and > that is basically "if key ingredients are missing or the firmware is > going to do its thing anyway, don't bother". > > > DT platforms doesn't pass any of the checks, so the NVMe driver always manages > > the power state on its own. Unfortunately, the resultant power saving is not > > enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down > > all the time. This is the first problem we are trying to solve. > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448 > > I see. > > This cannot be done by the driver itself, though, at least not in > general. The PCI layer needs to be involved and, if we are talking > about D3cold, the platform firmware needs to be involved either. > Right, but the device driver needs to have some idea about what state PCI core is going to choose for the device. I believe that's the purpose of pci_choose_state() API. More below... > As a rule, the PCI layer reaches out to the platform firmware for help > as needed and drivers don't take part in this directly. > > The NVMe driver would need to let the PCI layer take over and set the > PCI power state of the device if it wanted to get any deeper than NVMe > protocol specific power states. > > In principle, this can be addressed with some kind of a driver opt-in. > That is, the NVMe driver would continue to work the way it does, but > instead of completely preventing the PCI layer from taking over, it > would opt in (the exact opt-in mechanism is TBD) for D3cold if (a) the > platform firmware provides a mechanism to do so and (b) the DT > indicates that this mechanism should be used for the given device. > Ok, IIUC you are talking about something like this? static int nvme_suspend(struct device *dev) { ... if (pm_suspend_via_firmware() || !ctrl->npss || ... || pci_choose_state(pdev, PMSG_SUSPEND) == PCI_D3cold) return nvme_disable_prepare_reset(ndev, true); /* continue using protocol specific low power state */ ... } Here, pci_choose_state() should tell the driver if the device should enter D3Cold. ACPI already supports this API, now I need to add DT support (which is not straightforward, but doable). Since this API is already exported, I think it makes perfect sense to use it here (and other drivers for similar usecase). > > > However, there is an additional concern that putting an NVMe device > > > into D3cold every time during system suspend on Android might cause it > > > to wear out more quickly. > > > > > > > Right, this is the second problem. > > Let's set this one somewhat aside for now. We'll get back to it when > we have clarity in the above. > Ok. I believe this could be addressed in pci_choose_state() itself if required. > > > Is there anything else? > > > > We also need to consider the fact that the firmware in some platforms doesn't > > support S2R. So we cannot take a decision based on S2I/S2R alone. > > Right, the S2I/S2R thing is ACPI-specific. > > On platforms using ACPI, pm_suspend_via_firmware() means that the > firmware is at least likely to power down the whole platform, so the > PCI layer may as well be allowed to do what it wants with the device. > > > I think there are atleast a couple of ways to solve above mentioned problems: > > > > 1. Go extra mile, take account of all issues/limitations mentioned above and > > come up with an API. This API will provide a sane default suspend behavior to > > drivers (fixing first problem) and will also allow changing the behavior > > dynamically (to fix the second problem where kernel cannot distinguish Android > > vs other OS). > > I don't quite follow TBH. > > A "default suspend behavior" is there already and it is to allow the > PCI layer to set the device's power state (in collaboration with the > platform firmware). Thing is, the NVMe driver doesn't always want to > rely on it. > > > 2. Allow DT platforms to call pm_set_suspend_via_firmware() and make use of the > > existing behavior in the NVMe driver. This would only solve the first problem > > though. But would be a good start. > > That would mean just letting the PCI layer always take over on the > platforms that would call pm_set_suspend_via_firmware(). > > There is a potential issue with doing it, which is that everybody > calling pm_suspend_via_firmware() would then assume that the platform > would be powered down by the firmware. I'm not sure how much of an > issue that would be in practice, though. Yeah, that would be a concern. - Mani -- மணிவண்ணன் சதாசிவம்