From: Abhishek Sahu <abhsahu@xxxxxxxxxx> Currently, if the runtime power management is enabled for vfio-pci device in guest OS, then guest OS will do the register write for PCI_PM_CTRL register. This write request will be handled in vfio_pm_config_write() where it will do the actual register write of PCI_PM_CTRL register. With this, the maximum D3hot state can be achieved for low power. If we can use the runtime PM framework, then we can achieve the D3cold state which will help in saving maximum power. This patch uses runtime PM framework whenever vfio-pci device will be put in the low power state. 1. If runtime PM is enabled, then instead of directly writing PCI_PM_CTRL register, decrement the device usage counter whenever the power state is non-D0. The kernel runtime PM framework will itself put the PCI device in low power state when device usage counter will become zero. Similarly, when the power state will be again changed back to D0, then increment the device usage counter and the kernel runtime PM framework will itself bring the PCI device out of low power state. 2. The guest OS will read the PCI_PM_CTRL register back to confirm the current power state so virtual register bits can be used. For this, before decrementing the usage count, read the actual PCI_PM_CTRL register, update the power state related bits, and then update the vconfig bits corresponding to PCI_PM_CTRL offset. For PCI_PM_CTRL register read, return the virtual value if runtime PM is requested. This vconfig bits will be cleared when the power state will be changed back to D0. 3. For the guest OS, the PCI power state will be still D3hot even if put the actual PCI device into D3cold state. In the D3hot state, the config space can still be read. So, if there is any request from guest OS to read the config space, then we need to move the actual PCI device again back to D0. For this, increment the device usage count before reading/writing the config space and then decrement it again after reading/writing the config space. There can be back-to-back config register read/write request, but since the auto suspend methods are being used here so only first access will wake up the PCI device and for the remaining access, the device will already be active. 4. This above-mentioned wake up is not needed if the register read/write is done completely with virtual bits. For handling this condition, the actual resume of device will only being done in vfio_user_config_read()/vfio_user_config_write(). All the config register access will come vfio_pci_config_rw(). So, in this function, use the pm_runtime_get_noresume() so that only usage count will be incremented without resuming the device. Inside, vfio_user_config_read()/vfio_user_config_write(), use the routines with pm_runtime_put_noidle() so that the PCI device won’t be suspended in the lower level functions. Again in the top level vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the auto suspend timer will be started and the device will be suspended again. If the device is already runtime suspended, then this routine will return early. 5. In the host side D3cold will only be used if the platform has support for this, otherwise some other state will be used. The config space can be read if the device is not in D3cold state. So in this case, we can skip the resuming of PCI device. The wrapper function vfio_pci_config_pm_runtime_get() takes care of this condition and invoke the pm_runtime_resume() only if current power state is D3cold. 6. For vfio_pci_config_pm_runtime_get()/vfio_ pci_config_pm_runtime_put(), the reference code is taken from pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it is modified according to vfio-pci driver requirement. 7. vfio_pci_set_power_state() will be unused after moving to runtime PM, so this function can be removed along with other related functions and structure fields. Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx> --- drivers/vfio/pci/vfio_pci_config.c | 178 ++++++++++++++++++++++++++--- drivers/vfio/pci/vfio_pci_core.c | 64 ++--------- include/linux/vfio_pci_core.h | 3 +- 3 files changed, 174 insertions(+), 71 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index fb3a503a5b99..5576eb4308b4 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -25,6 +25,7 @@ #include <linux/uaccess.h> #include <linux/vfio.h> #include <linux/slab.h> +#include <linux/pm_runtime.h> #include <linux/vfio_pci_core.h> @@ -119,12 +120,51 @@ struct perm_bits { #define NO_WRITE 0 #define ALL_WRITE 0xFFFFFFFFU -static int vfio_user_config_read(struct pci_dev *pdev, int offset, +static void vfio_pci_config_pm_runtime_get(struct pci_dev *pdev) +{ + struct device *dev = &pdev->dev; + struct device *parent = dev->parent; + + if (parent) + pm_runtime_get_sync(parent); + + pm_runtime_get_noresume(dev); + /* + * pdev->current_state is set to PCI_D3cold during suspending, + * so wait until suspending completes + */ + pm_runtime_barrier(dev); + /* + * Only need to resume devices in D3cold, because config + * registers are still accessible for devices suspended but + * not in D3cold. + */ + if (pdev->current_state == PCI_D3cold) + pm_runtime_resume(dev); +} + +static void vfio_pci_config_pm_runtime_put(struct pci_dev *pdev) +{ + struct device *dev = &pdev->dev; + struct device *parent = dev->parent; + + pm_runtime_mark_last_busy(dev); + pm_runtime_put_noidle(dev); + + if (parent) + pm_runtime_put_noidle(parent); +} + +static int vfio_user_config_read(struct vfio_pci_core_device *vdev, int offset, __le32 *val, int count) { + struct pci_dev *pdev = vdev->pdev; int ret = -EINVAL; u32 tmp_val = 0; + if (vdev->pm_runtime_suspended) + vfio_pci_config_pm_runtime_get(pdev); + switch (count) { case 1: { @@ -147,15 +187,22 @@ static int vfio_user_config_read(struct pci_dev *pdev, int offset, *val = cpu_to_le32(tmp_val); + if (vdev->pm_runtime_suspended) + vfio_pci_config_pm_runtime_put(pdev); + return ret; } -static int vfio_user_config_write(struct pci_dev *pdev, int offset, +static int vfio_user_config_write(struct vfio_pci_core_device *vdev, int offset, __le32 val, int count) { + struct pci_dev *pdev = vdev->pdev; int ret = -EINVAL; u32 tmp_val = le32_to_cpu(val); + if (vdev->pm_runtime_suspended) + vfio_pci_config_pm_runtime_get(pdev); + switch (count) { case 1: ret = pci_user_write_config_byte(pdev, offset, tmp_val); @@ -168,6 +215,9 @@ static int vfio_user_config_write(struct pci_dev *pdev, int offset, break; } + if (vdev->pm_runtime_suspended) + vfio_pci_config_pm_runtime_put(pdev); + return ret; } @@ -183,11 +233,10 @@ static int vfio_default_config_read(struct vfio_pci_core_device *vdev, int pos, /* Any non-virtualized bits? */ if (cpu_to_le32(~0U >> (32 - (count * 8))) != virt) { - struct pci_dev *pdev = vdev->pdev; __le32 phys_val = 0; int ret; - ret = vfio_user_config_read(pdev, pos, &phys_val, count); + ret = vfio_user_config_read(vdev, pos, &phys_val, count); if (ret) return ret; @@ -224,18 +273,17 @@ static int vfio_default_config_write(struct vfio_pci_core_device *vdev, int pos, /* Non-virtualzed and writable bits go to hardware */ if (write & ~virt) { - struct pci_dev *pdev = vdev->pdev; __le32 phys_val = 0; int ret; - ret = vfio_user_config_read(pdev, pos, &phys_val, count); + ret = vfio_user_config_read(vdev, pos, &phys_val, count); if (ret) return ret; phys_val &= ~(write & ~virt); phys_val |= (val & (write & ~virt)); - ret = vfio_user_config_write(pdev, pos, phys_val, count); + ret = vfio_user_config_write(vdev, pos, phys_val, count); if (ret) return ret; } @@ -250,7 +298,7 @@ static int vfio_direct_config_read(struct vfio_pci_core_device *vdev, int pos, { int ret; - ret = vfio_user_config_read(vdev->pdev, pos, val, count); + ret = vfio_user_config_read(vdev, pos, val, count); if (ret) return ret; @@ -275,7 +323,7 @@ static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos, { int ret; - ret = vfio_user_config_write(vdev->pdev, pos, val, count); + ret = vfio_user_config_write(vdev, pos, val, count); if (ret) return ret; @@ -288,7 +336,7 @@ static int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos, { int ret; - ret = vfio_user_config_read(vdev->pdev, pos, val, count); + ret = vfio_user_config_read(vdev, pos, val, count); if (ret) return ret; @@ -692,13 +740,86 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm) return 0; } +static int vfio_perform_runtime_pm(struct vfio_pci_core_device *vdev, int pos, + int count, struct perm_bits *perm, + int offset, __le32 val, pci_power_t state) +{ + /* + * If runtime PM is enabled, then instead of directly writing + * PCI_PM_CTRL register, decrement the device usage counter whenever + * the power state is non-D0. The kernel runtime PM framework will + * itself put the PCI device in the low power state when device usage + * counter will become zero. The guest OS will read the PCI_PM_CTRL + * register back to confirm the current power state so virtual + * register bits can be used. For this, read the actual PCI_PM_CTRL + * register, update the power state related bits and then update the + * vconfig bits corresponding to PCI_PM_CTRL offset. If the + * pm_runtime_suspended status is set, then return the virtual + * register value for PCI_PM_CTRL register read. All the bits + * in PCI_PM_CTRL are being returned from virtual config, so that + * this register read will not wake up the PCI device from suspended + * state. + * + * Once power state will be changed back to D0, then clear the power + * state related bits in vconfig. After this, increment the device + * usage counter which will internally wake up the PCI device from + * suspended state. + */ + if (state != PCI_D0 && !vdev->pm_runtime_suspended) { + __le32 virt_val = 0; + + count = vfio_default_config_write(vdev, pos, count, perm, + offset, val); + if (count < 0) + return count; + + vfio_default_config_read(vdev, pos, 4, perm, offset, &virt_val); + virt_val &= ~cpu_to_le32(PCI_PM_CTRL_STATE_MASK); + virt_val |= (val & cpu_to_le32(PCI_PM_CTRL_STATE_MASK)); + memcpy(vdev->vconfig + pos, &virt_val, 4); + vdev->pm_runtime_suspended = true; + pm_runtime_mark_last_busy(&vdev->pdev->dev); + pm_runtime_put_autosuspend(&vdev->pdev->dev); + return count; + } + + if (vdev->pm_runtime_suspended && state == PCI_D0) { + vdev->pm_runtime_suspended = false; + *(__le16 *)&vdev->vconfig[pos] &= + ~cpu_to_le16(PCI_PM_CTRL_STATE_MASK); + pm_runtime_get_sync(&vdev->pdev->dev); + } + + return vfio_default_config_write(vdev, pos, count, perm, offset, val); +} + +static int vfio_pm_config_read(struct vfio_pci_core_device *vdev, int pos, + int count, struct perm_bits *perm, + int offset, __le32 *val) +{ + /* + * If pm_runtime_suspended status is set, then return the virtual + * register value for PCI_PM_CTRL register read. + */ + if (vdev->pm_runtime_suspended && + offset >= PCI_PM_CTRL && offset < (PCI_PM_CTRL + 4)) { + memcpy(val, vdev->vconfig + pos, count); + return count; + } + + return vfio_default_config_read(vdev, pos, count, perm, offset, val); +} + static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 val) { - count = vfio_default_config_write(vdev, pos, count, perm, offset, val); - if (count < 0) - return count; + if (offset != PCI_PM_CTRL) { + count = vfio_default_config_write(vdev, pos, count, perm, + offset, val); + if (count < 0) + return count; + } if (offset == PCI_PM_CTRL) { pci_power_t state; @@ -718,7 +839,8 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos, break; } - vfio_pci_set_power_state(vdev, state); + count = vfio_perform_runtime_pm(vdev, pos, count, perm, + offset, val, state); } return count; @@ -731,6 +853,7 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm) return -ENOMEM; perm->writefn = vfio_pm_config_write; + perm->readfn = vfio_pm_config_read; /* * We always virtualize the next field so we can remove @@ -1921,13 +2044,31 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf, size_t done = 0; int ret = 0; loff_t pos = *ppos; + bool runtime_put_required = false; pos &= VFIO_PCI_OFFSET_MASK; + /* + * For virtualized bits read/write, the device should not be resumed. + * Increment the device usage count alone so that the device won't be + * runtime suspended during config read/write. + */ + if (vdev->pm_runtime_suspended) { + pm_runtime_get_noresume(&vdev->pdev->dev); + runtime_put_required = true; + } + while (count) { ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite); - if (ret < 0) + if (ret < 0) { + /* + * Decrement the device usage counter corresponding to + * previous pm_runtime_get_noresume(). + */ + if (runtime_put_required) + pm_runtime_put_autosuspend(&vdev->pdev->dev); return ret; + } count -= ret; done += ret; @@ -1937,5 +2078,12 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf, *ppos += done; + /* + * Decrement the device usage counter corresponding to previous + * pm_runtime_get_noresume(). + */ + if (runtime_put_required) + pm_runtime_put_autosuspend(&vdev->pdev->dev); + return done; } diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 511a52e92b32..79fa86914b6c 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -187,57 +187,6 @@ static bool vfio_pci_nointx(struct pci_dev *pdev) return false; } -static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev) -{ - struct pci_dev *pdev = vdev->pdev; - u16 pmcsr; - - if (!pdev->pm_cap) - return; - - pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr); - - vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET); -} - -/* - * pci_set_power_state() wrapper handling devices which perform a soft reset on - * D3->D0 transition. Save state prior to D0/1/2->D3, stash it on the vdev, - * restore when returned to D0. Saved separately from pci_saved_state for use - * by PM capability emulation and separately from pci_dev internal saved state - * to avoid it being overwritten and consumed around other resets. - */ -int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state) -{ - struct pci_dev *pdev = vdev->pdev; - bool needs_restore = false, needs_save = false; - int ret; - - if (vdev->needs_pm_restore) { - if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) { - pci_save_state(pdev); - needs_save = true; - } - - if (pdev->current_state >= PCI_D3hot && state <= PCI_D0) - needs_restore = true; - } - - ret = pci_set_power_state(pdev, state); - - if (!ret) { - /* D3 might be unsupported via quirk, skip unless in D3 */ - if (needs_save && pdev->current_state >= PCI_D3hot) { - vdev->pm_save = pci_store_saved_state(pdev); - } else if (needs_restore) { - pci_load_and_free_saved_state(pdev, &vdev->pm_save); - pci_restore_state(pdev); - } - } - - return ret; -} - int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) { struct pci_dev *pdev = vdev->pdev; @@ -323,6 +272,16 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) /* For needs_reset */ lockdep_assert_held(&vdev->vdev.dev_set->lock); + /* + * The vfio device user can close the device after putting the device + * into runtime suspended state so wake up the device first in + * this case. + */ + if (vdev->pm_runtime_suspended) { + vdev->pm_runtime_suspended = false; + pm_runtime_get_sync(&pdev->dev); + } + /* Stop the device from further DMA */ pci_clear_master(pdev); @@ -1809,7 +1768,6 @@ void vfio_pci_core_uninit_device(struct vfio_pci_core_device *vdev) mutex_destroy(&vdev->vma_lock); vfio_uninit_group_dev(&vdev->vdev); kfree(vdev->region); - kfree(vdev->pm_save); } EXPORT_SYMBOL_GPL(vfio_pci_core_uninit_device); @@ -1855,8 +1813,6 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev) if (ret) goto out_vf; - vfio_pci_probe_power_state(vdev); - /* * pci-core sets the device power state to an unknown value at * bootup and after being removed from a driver. The only diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index aafe09c9fa64..2b1a556ce73f 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -123,9 +123,8 @@ struct vfio_pci_core_device { bool has_vga; bool needs_reset; bool nointx; - bool needs_pm_restore; + bool pm_runtime_suspended; struct pci_saved_state *pci_saved_state; - struct pci_saved_state *pm_save; int ioeventfds_nr; struct eventfd_ctx *err_trigger; struct eventfd_ctx *req_trigger; -- 2.17.1