On Mon, 25 Apr 2022 14:56:12 +0530 Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > Currently, there is very limited power management support > available in the upstream vfio_pci_core based drivers. If there > are no users of the device, then the PCI device will be moved into > D3hot state by writing directly into PCI PM registers. This D3hot > state help in saving power but we can achieve zero power consumption > if we go into the D3cold state. The D3cold state cannot be possible > with native PCI PM. It requires interaction with platform firmware > which is system-specific. To go into low power states (including D3cold), > the runtime PM framework can be used which internally interacts with PCI > and platform firmware and puts the device into the lowest possible > D-States. > > This patch registers vfio_pci_core based drivers with the > runtime PM framework. > > 1. The PCI core framework takes care of most of the runtime PM > related things. For enabling the runtime PM, the PCI driver needs to > decrement the usage count and needs to provide 'struct dev_pm_ops' > at least. The runtime suspend/resume callbacks are optional and needed > only if we need to do any extra handling. Now there are multiple > vfio_pci_core based drivers. Instead of assigning the > 'struct dev_pm_ops' in individual parent driver, the vfio_pci_core > itself assigns the 'struct dev_pm_ops'. There are other drivers where > the 'struct dev_pm_ops' is being assigned inside core layer > (For example, wlcore_probe() and some sound based driver, etc.). > > 2. This patch provides the stub implementation of 'struct dev_pm_ops'. > The subsequent patch will provide the runtime suspend/resume > callbacks. All the config state saving, and PCI power management > related things will be done by PCI core framework itself inside its > runtime suspend/resume callbacks (pci_pm_runtime_suspend() and > pci_pm_runtime_resume()). > > 3. Inside pci_reset_bus(), all the devices in dev_set needs to be > runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take > care of the runtime resume and its error handling. > > 4. Inside vfio_pci_core_disable(), the device usage count always needs > to be decremented which was incremented in vfio_pci_core_enable(). > > 5. Since the runtime PM framework will provide the same functionality, > so directly writing into PCI PM config register can be replaced with > the use of runtime PM routines. Also, the use of runtime PM can help > us in more power saving. > > In the systems which do not support D3cold, > > With the existing implementation: > > // PCI device > # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state > D3hot > // upstream bridge > # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state > D0 > > With runtime PM: > > // PCI device > # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state > D3hot > // upstream bridge > # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state > D3hot > > So, with runtime PM, the upstream bridge or root port will also go > into lower power state which is not possible with existing > implementation. > > In the systems which support D3cold, > > // PCI device > # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state > D3hot > // upstream bridge > # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state > D0 > > With runtime PM: > > // PCI device > # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state > D3cold > // upstream bridge > # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state > D3cold > > So, with runtime PM, both the PCI device and upstream bridge will > go into D3cold state. > > 6. If 'disable_idle_d3' module parameter is set, then also the runtime > PM will be enabled, but in this case, the usage count should not be > decremented. > > 7. vfio_pci_dev_set_try_reset() return value is unused now, so this > function return type can be changed to void. > > 8. Use the runtime PM API's in vfio_pci_core_sriov_configure(). > For preventing any runtime usage mismatch, pci_num_vf() has been > called explicitly during disable. > > Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx> > --- > drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++---------- > 1 file changed, 114 insertions(+), 55 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 953ac33b2f5f..aee5e0cd6137 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev) > } > > struct vfio_pci_group_info; > -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); > +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); > static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > struct vfio_pci_group_info *groups); > > @@ -261,6 +261,19 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat > return ret; > } > > +#ifdef CONFIG_PM > +/* > + * The dev_pm_ops needs to be provided to make pci-driver runtime PM working, > + * so use structure without any callbacks. > + * > + * The pci-driver core runtime PM routines always save the device state > + * before going into suspended state. If the device is going into low power > + * state with only with runtime PM ops, then no explicit handling is needed > + * for the devices which have NoSoftRst-. > + */ > +static const struct dev_pm_ops vfio_pci_core_pm_ops = { }; > +#endif > + > int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > { > struct pci_dev *pdev = vdev->pdev; > @@ -268,21 +281,23 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > u16 cmd; > u8 msix_pos; > > - vfio_pci_set_power_state(vdev, PCI_D0); > + if (!disable_idle_d3) { > + ret = pm_runtime_resume_and_get(&pdev->dev); > + if (ret < 0) > + return ret; > + } > > /* Don't allow our initial saved state to include busmaster */ > pci_clear_master(pdev); > > ret = pci_enable_device(pdev); > if (ret) > - return ret; > + goto out_power; > > /* If reset fails because of the device lock, fail this path entirely */ > ret = pci_try_reset_function(pdev); > - if (ret == -EAGAIN) { > - pci_disable_device(pdev); > - return ret; > - } > + if (ret == -EAGAIN) > + goto out_disable_device; > > vdev->reset_works = !ret; > pci_save_state(pdev); > @@ -306,12 +321,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > } > > ret = vfio_config_init(vdev); > - if (ret) { > - kfree(vdev->pci_saved_state); > - vdev->pci_saved_state = NULL; > - pci_disable_device(pdev); > - return ret; > - } > + if (ret) > + goto out_free_state; > > msix_pos = pdev->msix_cap; > if (msix_pos) { > @@ -332,6 +343,16 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > > > return 0; > + > +out_free_state: > + kfree(vdev->pci_saved_state); > + vdev->pci_saved_state = NULL; > +out_disable_device: > + pci_disable_device(pdev); > +out_power: > + if (!disable_idle_d3) > + pm_runtime_put(&pdev->dev); > + return ret; > } > EXPORT_SYMBOL_GPL(vfio_pci_core_enable); > > @@ -439,8 +460,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) > out: > pci_disable_device(pdev); > > - if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3) > - vfio_pci_set_power_state(vdev, PCI_D3hot); > + vfio_pci_dev_set_try_reset(vdev->vdev.dev_set); > + > + /* Put the pm-runtime usage counter acquired during enable */ > + if (!disable_idle_d3) > + pm_runtime_put(&pdev->dev); > } > EXPORT_SYMBOL_GPL(vfio_pci_core_disable); > > @@ -1879,19 +1903,24 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev, > > vfio_pci_probe_power_state(vdev); > > - if (!disable_idle_d3) { > - /* > - * pci-core sets the device power state to an unknown value at > - * bootup and after being removed from a driver. The only > - * transition it allows from this unknown state is to D0, which > - * typically happens when a driver calls pci_enable_device(). > - * We're not ready to enable the device yet, but we do want to > - * be able to get to D3. Therefore first do a D0 transition > - * before going to D3. > - */ > - vfio_pci_set_power_state(vdev, PCI_D0); > - vfio_pci_set_power_state(vdev, PCI_D3hot); > - } > + /* > + * pci-core sets the device power state to an unknown value at > + * bootup and after being removed from a driver. The only > + * transition it allows from this unknown state is to D0, which > + * typically happens when a driver calls pci_enable_device(). > + * We're not ready to enable the device yet, but we do want to > + * be able to get to D3. Therefore first do a D0 transition > + * before enabling runtime PM. > + */ > + vfio_pci_set_power_state(vdev, PCI_D0); > + > +#if defined(CONFIG_PM) > + dev->driver->pm = &vfio_pci_core_pm_ops, > +#endif > + > + pm_runtime_allow(dev); > + if (!disable_idle_d3) > + pm_runtime_put(dev); > > ret = vfio_register_group_dev(&vdev->vdev); > if (ret) > @@ -1900,7 +1929,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev, > > out_power: > if (!disable_idle_d3) > - vfio_pci_set_power_state(vdev, PCI_D0); > + pm_runtime_get_noresume(dev); > + > + pm_runtime_forbid(dev); > out_vf: > vfio_pci_vf_uninit(vdev); > out_drvdata: > @@ -1922,8 +1953,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev) > vfio_pci_vga_uninit(vdev); > > if (!disable_idle_d3) > - vfio_pci_set_power_state(vdev, PCI_D0); > + pm_runtime_get_noresume(dev); > > + pm_runtime_forbid(dev); > dev_set_drvdata(dev, NULL); > } > EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device); > @@ -1984,18 +2016,26 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn) > > /* > * The PF power state should always be higher than the VF power > - * state. If PF is in the low power state, then change the > - * power state to D0 first before enabling SR-IOV. > + * state. If PF is in the runtime suspended state, then resume > + * it first before enabling SR-IOV. > */ > - vfio_pci_set_power_state(vdev, PCI_D0); > - ret = pci_enable_sriov(pdev, nr_virtfn); > + ret = pm_runtime_resume_and_get(&pdev->dev); > if (ret) > goto out_del; > + > + ret = pci_enable_sriov(pdev, nr_virtfn); > + if (ret) { > + pm_runtime_put(&pdev->dev); > + goto out_del; > + } > ret = nr_virtfn; > goto out_put; > } > > - pci_disable_sriov(pdev); > + if (pci_num_vf(pdev)) { > + pci_disable_sriov(pdev); > + pm_runtime_put(&pdev->dev); > + } > > out_del: > mutex_lock(&vfio_pci_sriov_pfs_mutex); > @@ -2072,6 +2112,30 @@ vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set) > return pdev; > } > > +static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set) > +{ > + struct vfio_pci_core_device *cur_pm; > + struct vfio_pci_core_device *cur; > + int ret = 0; > + > + list_for_each_entry(cur_pm, &dev_set->device_list, vdev.dev_set_list) { > + ret = pm_runtime_resume_and_get(&cur_pm->pdev->dev); > + if (ret < 0) > + break; > + } > + > + if (!ret) > + return 0; > + > + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { > + if (cur == cur_pm) > + break; > + pm_runtime_put(&cur->pdev->dev); > + } > + > + return ret; > +} The above works, but maybe could be a little cleaner taking advantage of list_for_each_entry_continue_reverse as: { struct vfio_pci_core_device *cur; int ret; list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { ret = pm_runtime_resume_and_get(&cur->pdev->dev); if (ret) goto unwind; } return 0; unwind: list_for_each_entry_continue_reverse(cur, &dev_set->device_list, vdev.dev_set_list) pm_runtime_put(&cur->pdev->dev); return ret; } Thanks, Alex > + > /* > * We need to get memory_lock for each device, but devices can share mmap_lock, > * therefore we need to zap and hold the vma_lock for each device, and only then > @@ -2178,43 +2242,38 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set) > * - At least one of the affected devices is marked dirty via > * needs_reset (such as by lack of FLR support) > * Then attempt to perform that bus or slot reset. > - * Returns true if the dev_set was reset. > */ > -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) > +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) > { > struct vfio_pci_core_device *cur; > struct pci_dev *pdev; > - int ret; > + bool reset_done = false; > > if (!vfio_pci_dev_set_needs_reset(dev_set)) > - return false; > + return; > > pdev = vfio_pci_dev_set_resettable(dev_set); > if (!pdev) > - return false; > + return; > > /* > - * The pci_reset_bus() will reset all the devices in the bus. > - * The power state can be non-D0 for some of the devices in the bus. > - * For these devices, the pci_reset_bus() will internally set > - * the power state to D0 without vfio driver involvement. > - * For the devices which have NoSoftRst-, the reset function can > - * cause the PCI config space reset without restoring the original > - * state (saved locally in 'vdev->pm_save'). > + * Some of the devices in the bus can be in the runtime suspended > + * state. Increment the usage count for all the devices in the dev_set > + * before reset and decrement the same after reset. > */ > - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) > - vfio_pci_set_power_state(cur, PCI_D0); > + if (!disable_idle_d3 && vfio_pci_dev_set_pm_runtime_get(dev_set)) > + return; > > - ret = pci_reset_bus(pdev); > - if (ret) > - return false; > + if (!pci_reset_bus(pdev)) > + reset_done = true; > > list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { > - cur->needs_reset = false; > + if (reset_done) > + cur->needs_reset = false; > + > if (!disable_idle_d3) > - vfio_pci_set_power_state(cur, PCI_D3hot); > + pm_runtime_put(&cur->pdev->dev); > } > - return true; > } > > void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,