On Fri, 1 Jul 2022 16:38:12 +0530 Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > Currently, if the runtime power management is enabled for vfio-pci > based devices in the guest OS, then the guest OS will do the register > write for PCI_PM_CTRL register. This write request will be handled in > vfio_pm_config_write() where it will do the actual register write of > PCI_PM_CTRL register. With this, the maximum D3hot state can be > achieved for low power. If we can use the runtime PM framework, then > we can achieve the D3cold state (on the supported systems) which will > help in saving maximum power. > > 1. D3cold state can't be achieved by writing PCI standard > PM config registers. This patch implements the newly added > 'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' device feature which > can be used for putting the device into the D3cold state. > > 2. The hypervisors can implement virtual ACPI methods. For example, > in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power > resources with _ON/_OFF method, then guest linux OS invokes > the _OFF method during D3cold transition and then _ON during D0 > transition. The hypervisor can tap these virtual ACPI calls and then > call the 'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' with respective flags. > > 3. The vfio-pci driver uses runtime PM framework to achieve the > D3cold state. For the D3cold transition, decrement the usage count and > for the D0 transition, increment the usage count. > > 4. If the D3cold state is not supported, then the device will > still be in the D3hot state. But with the runtime PM, the root port > can now also go into the suspended state. > > 5. The 'pm_runtime_engaged' flag tracks the entry and exit to > runtime PM. This flag is protected with 'memory_lock' semaphore. > > 6. During exit time, the flag clearing and usage count increment > are protected with 'memory_lock'. The actual wake-up is happening > outside 'memory_lock' since 'memory_lock' will be needed inside > runtime_resume callback also in subsequent patches. > > 7. In D3cold, all kinds of device-related access (BAR read/write, > config read/write, etc.) need to be disabled. For BAR-related access, > we can use existing D3hot memory disable support. During the low power > entry, invalidate the mmap mappings and add the check for > 'pm_runtime_engaged' flag. Not disabled, just wrapped in pm-get/put. If the device is indefinitely in low-power without a wake-up eventfd, mmap faults are fatal to the user. > 8. For config space, ideally, we need to return an error whenever > there is any config access from the user side once the user moved the > device into low power state. But adding a check for > 'pm_runtime_engaged' flag alone won't be sufficient due to the > following possible scenario from the user side where config space > access happens parallelly with the low power entry IOCTL. > > a. Config space access happens and vfio_pci_config_rw() will be > called. > b. The IOCTL to move into low power state is called. > c. The IOCTL will move the device into d3cold. > d. Exit from vfio_pci_config_rw() happened. > > Now, if we just check 'pm_runtime_engaged', then in the above > sequence the config space access will happen when the device already > is in the low power state. To prevent this situation, we increment the > usage count before any config space access and decrement the same > after completing the access. Also, to prevent any similar cases for > other types of access, the usage count will be incremented for all > kinds of access. Unnecessary, just wrap in pm-get/put. Thanks, Alex > Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx> > --- > drivers/vfio/pci/vfio_pci_config.c | 2 +- > drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++++++++-- > include/linux/vfio_pci_core.h | 1 + > 3 files changed, 164 insertions(+), 8 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c > index 9343f597182d..21a4743d011f 100644 > --- a/drivers/vfio/pci/vfio_pci_config.c > +++ b/drivers/vfio/pci/vfio_pci_config.c > @@ -408,7 +408,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev) > * PF SR-IOV capability, there's therefore no need to trigger > * faults based on the virtual value. > */ > - return pdev->current_state < PCI_D3hot && > + return !vdev->pm_runtime_engaged && pdev->current_state < PCI_D3hot && > (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY)); > } > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 5948d930449b..8c17ca41d156 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -264,6 +264,18 @@ static int vfio_pci_core_runtime_suspend(struct device *dev) > { > struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); > > + down_write(&vdev->memory_lock); > + /* > + * The user can move the device into D3hot state before invoking > + * power management IOCTL. Move the device into D0 state here and then > + * the pci-driver core runtime PM suspend function will move the device > + * into the low power state. Also, for the devices which have > + * NoSoftRst-, it will help in restoring the original state > + * (saved locally in 'vdev->pm_save'). > + */ > + vfio_pci_set_power_state(vdev, PCI_D0); > + up_write(&vdev->memory_lock); > + > /* > * If INTx is enabled, then mask INTx before going into the runtime > * suspended state and unmask the same in the runtime resume. > @@ -386,6 +398,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) > struct pci_dev *pdev = vdev->pdev; > struct vfio_pci_dummy_resource *dummy_res, *tmp; > struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp; > + bool do_resume = false; > int i, bar; > > /* For needs_reset */ > @@ -393,6 +406,25 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) > > /* > * This function can be invoked while the power state is non-D0. > + * This non-D0 power state can be with or without runtime PM. > + * Increment the usage count corresponding to pm_runtime_put() > + * called during setting of 'pm_runtime_engaged'. The device will > + * wake up if it has already gone into the suspended state. > + * Otherwise, the next vfio_pci_set_power_state() will change the > + * device power state to D0. > + */ > + down_write(&vdev->memory_lock); > + if (vdev->pm_runtime_engaged) { > + vdev->pm_runtime_engaged = false; > + pm_runtime_get_noresume(&pdev->dev); > + do_resume = true; > + } > + up_write(&vdev->memory_lock); > + > + if (do_resume) > + pm_runtime_resume(&pdev->dev); > + > + /* > * This function calls __pci_reset_function_locked() which internally > * can use pci_pm_reset() for the function reset. pci_pm_reset() will > * fail if the power state is non-D0. Also, for the devices which > @@ -1190,6 +1222,99 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, > } > EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); > > +static int vfio_pci_pm_validate_flags(u32 flags) > +{ > + if (!flags) > + return -EINVAL; > + /* Only valid flags should be set */ > + if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) > + return -EINVAL; > + /* Both enter and exit should not be set */ > + if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) == > + (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) > + return -EINVAL; > + > + return 0; > +} > + > +static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags, > + void __user *arg, size_t argsz) > +{ > + struct vfio_pci_core_device *vdev = > + container_of(device, struct vfio_pci_core_device, vdev); > + struct pci_dev *pdev = vdev->pdev; > + struct vfio_device_feature_power_management vfio_pm = { 0 }; > + int ret = 0; > + > + ret = vfio_check_feature(flags, argsz, > + VFIO_DEVICE_FEATURE_SET | > + VFIO_DEVICE_FEATURE_GET, > + sizeof(vfio_pm)); > + if (ret != 1) > + return ret; > + > + if (flags & VFIO_DEVICE_FEATURE_GET) { > + down_read(&vdev->memory_lock); > + if (vdev->pm_runtime_engaged) > + vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER; > + else > + vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT; > + up_read(&vdev->memory_lock); > + > + if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm))) > + return -EFAULT; > + > + return 0; > + } > + > + if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm))) > + return -EFAULT; > + > + ret = vfio_pci_pm_validate_flags(vfio_pm.flags); > + if (ret) > + return ret; > + > + /* > + * The vdev power related flags are protected with 'memory_lock' > + * semaphore. > + */ > + if (vfio_pm.flags & VFIO_PM_LOW_POWER_ENTER) { > + vfio_pci_zap_and_down_write_memory_lock(vdev); > + if (vdev->pm_runtime_engaged) { > + up_write(&vdev->memory_lock); > + return -EINVAL; > + } > + > + vdev->pm_runtime_engaged = true; > + up_write(&vdev->memory_lock); > + pm_runtime_put(&pdev->dev); > + } else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) { > + down_write(&vdev->memory_lock); > + if (!vdev->pm_runtime_engaged) { > + up_write(&vdev->memory_lock); > + return -EINVAL; > + } > + > + vdev->pm_runtime_engaged = false; > + pm_runtime_get_noresume(&pdev->dev); > + up_write(&vdev->memory_lock); > + ret = pm_runtime_resume(&pdev->dev); > + if (ret < 0) { > + down_write(&vdev->memory_lock); > + if (!vdev->pm_runtime_engaged) { > + vdev->pm_runtime_engaged = true; > + pm_runtime_put_noidle(&pdev->dev); > + } > + up_write(&vdev->memory_lock); > + return ret; > + } > + } else { > + return -EINVAL; > + } > + > + return 0; > +} > + > static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, > void __user *arg, size_t argsz) > { > @@ -1224,6 +1349,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, > switch (flags & VFIO_DEVICE_FEATURE_MASK) { > case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: > return vfio_pci_core_feature_token(device, flags, arg, argsz); > + case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT: > + return vfio_pci_core_feature_pm(device, flags, arg, argsz); > default: > return -ENOTTY; > } > @@ -1234,31 +1361,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf, > size_t count, loff_t *ppos, bool iswrite) > { > unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); > + int ret; > > if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions) > return -EINVAL; > > + ret = pm_runtime_resume_and_get(&vdev->pdev->dev); > + if (ret < 0) { > + pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n", > + ret); > + return -EIO; > + } > + > switch (index) { > case VFIO_PCI_CONFIG_REGION_INDEX: > - return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite); > + ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite); > + break; > > case VFIO_PCI_ROM_REGION_INDEX: > if (iswrite) > - return -EINVAL; > - return vfio_pci_bar_rw(vdev, buf, count, ppos, false); > + ret = -EINVAL; > + else > + ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false); > + break; > > case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX: > - return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite); > + ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite); > + break; > > case VFIO_PCI_VGA_REGION_INDEX: > - return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite); > + ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite); > + break; > + > default: > index -= VFIO_PCI_NUM_REGIONS; > - return vdev->region[index].ops->rw(vdev, buf, > + ret = vdev->region[index].ops->rw(vdev, buf, > count, ppos, iswrite); > + break; > } > > - return -EINVAL; > + pm_runtime_put(&vdev->pdev->dev); > + return ret; > } > > ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf, > @@ -2157,6 +2300,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > goto err_unlock; > } > > + /* > + * Some of the devices in the dev_set can be in the runtime suspended > + * state. Increment the usage count for all the devices in the dev_set > + * before reset and decrement the same after reset. > + */ > + ret = vfio_pci_dev_set_pm_runtime_get(dev_set); > + if (ret) > + goto err_unlock; > + > list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) { > /* > * Test whether all the affected devices are contained by the > @@ -2212,6 +2364,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > else > mutex_unlock(&cur->vma_lock); > } > + > + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) > + pm_runtime_put(&cur->pdev->dev); > err_unlock: > mutex_unlock(&dev_set->lock); > return ret; > diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h > index cdfd328ba6b1..bf4823b008f9 100644 > --- a/include/linux/vfio_pci_core.h > +++ b/include/linux/vfio_pci_core.h > @@ -125,6 +125,7 @@ struct vfio_pci_core_device { > bool nointx; > bool needs_pm_restore; > bool pm_intx_masked; > + bool pm_runtime_engaged; > struct pci_saved_state *pci_saved_state; > struct pci_saved_state *pm_save; > int ioeventfds_nr;