On Fri, 1 Jul 2022 16:38:14 +0530 Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > If the PCI device is in low power state and the device requires > wake-up, then it can generate PME (Power Management Events). Mostly > these PME events will be propagated to the root port and then the > root port will generate the system interrupt. Then the OS should > identify the device which generated the PME and should resume > the device. > > We can implement a similar virtual PME framework where if the device > already went into the runtime suspended state and then there is any > wake-up on the host side, then it will send the virtual PME > notification to the guest. This virtual PME will be helpful for the cases > where the device will not be suspended again if there is any wake-up > triggered by the host. Following is the overall approach regarding > the virtual PME. > > 1. Add one more event like VFIO_PCI_ERR_IRQ_INDEX named > VFIO_PCI_PME_IRQ_INDEX and do the required code changes to get/set > this new IRQ. > > 2. From the guest side, the guest needs to enable eventfd for the > virtual PME notification. > > 3. In the vfio-pci driver, the PME support bits are currently > virtualized and set to 0. We can set PME capability support for all > the power states. This PME capability support is independent of the > physical PME support. > > 4. The PME enable (PME_En bit in Power Management Control/Status > Register) and PME status (PME_Status bit in Power Management > Control/Status Register) are also virtualized currently. > The write support for PME_En bit can be enabled. > > 5. The PME_Status bit is a write-1-clear bit where the write with > zero value will have no effect and write with 1 value will clear the > bit. The write for this bit will be trapped inside > vfio_pm_config_write() similar to PCI_PM_CTRL write for PM_STATES. > > 6. When the host gets a request for resuming the device other than from > low power exit feature IOCTL, then PME_Status bit will be set. > According to [PCIe v5 7.5.2.2], > "PME_Status - This bit is Set when the Function would normally > generate a PME signal. The value of this bit is not affected by > the value of the PME_En bit." > > So even if PME_En bit is not set, we can set PME_Status bit. > > 7. If the guest has enabled PME_En and registered for PME events > through eventfd, then the usage count will be incremented to prevent > the device to go into the suspended state and notify the guest through > eventfd trigger. > > The virtual PME can help in handling physical PME also. When > physical PME comes, then also the runtime resume will be called. If > the guest has registered for virtual PME, then it will be sent in this > case also. > > * Implementation for handling the virtual PME on the hypervisor: > > If we take the implementation in Linux OS, then during runtime suspend > time, then it calls __pci_enable_wake(). It internally enables PME > through pci_pme_active() and also enables the ACPI side wake-up > through platform_pci_set_wakeup(). To handle the PME, the hypervisor has > the following two options: > > 1. Create a virtual root port for the VFIO device and trigger > interrupt when the PME comes. It will call pcie_pme_irq() which will > resume the device. > > 2. Create a virtual ACPI _PRW resource and associate it with the device > itself. In _PRW, any GPE (General Purpose Event) can be assigned for > the wake-up. When PME comes, then GPE can be triggered by the > hypervisor. GPE interrupt will call pci_acpi_wake_dev() function > internally and it will resume the device. Do we really need to implement PME emulation in the kernel or is it sufficient for userspace to simply register a one-shot eventfd when SET'ing the low power feature and QEMU can provide the PME emulation based on that signaling? Thanks, Alex > > Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx> > --- > drivers/vfio/pci/vfio_pci_config.c | 39 +++++++++++++++++++++------ > drivers/vfio/pci/vfio_pci_core.c | 43 ++++++++++++++++++++++++------ > drivers/vfio/pci/vfio_pci_intrs.c | 18 +++++++++++++ > include/linux/vfio_pci_core.h | 2 ++ > include/uapi/linux/vfio.h | 1 + > 5 files changed, 87 insertions(+), 16 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c > index 21a4743d011f..a06375a03758 100644 > --- a/drivers/vfio/pci/vfio_pci_config.c > +++ b/drivers/vfio/pci/vfio_pci_config.c > @@ -719,6 +719,20 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos, > if (count < 0) > return count; > > + /* > + * PME_STATUS is write-1-clear bit. If PME_STATUS is 1, then clear the > + * bit in vconfig. The PME_STATUS is in the upper byte of the control > + * register and user can do single byte write also. > + */ > + if (offset <= PCI_PM_CTRL + 1 && offset + count > PCI_PM_CTRL + 1) { > + if (le32_to_cpu(val) & > + (PCI_PM_CTRL_PME_STATUS >> (offset - PCI_PM_CTRL) * 8)) { > + __le16 *ctrl = (__le16 *)&vdev->vconfig > + [vdev->pm_cap_offset + PCI_PM_CTRL]; > + *ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_STATUS); > + } > + } > + > if (offset == PCI_PM_CTRL) { > pci_power_t state; > > @@ -771,14 +785,16 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm) > * the user change power state, but we trap and initiate the > * change ourselves, so the state bits are read-only. > * > - * The guest can't process PME from D3cold so virtualize PME_Status > - * and PME_En bits. The vconfig bits will be cleared during device > - * capability initialization. > + * The guest can't process physical PME from D3cold so virtualize > + * PME_Status and PME_En bits. These bits will be used for the > + * virtual PME between host and guest. The vconfig bits will be > + * updated during device capability initialization. PME_Status is > + * write-1-clear bit, so it is read-only. We trap and update the > + * vconfig bit manually during write. > */ > p_setd(perm, PCI_PM_CTRL, > PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS, > - ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS | > - PCI_PM_CTRL_STATE_MASK)); > + ~(PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_PME_STATUS)); > > return 0; > } > @@ -1454,8 +1470,13 @@ static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev, > __le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC]; > __le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL]; > > - /* Clear vconfig PME_Support, PME_Status, and PME_En bits */ > - *pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK); > + /* > + * Set the vconfig PME_Support bits. The PME_Status is being used for > + * virtual PME support and is not dependent upon the physical > + * PME support. > + */ > + *pmc |= cpu_to_le16(PCI_PM_CAP_PME_MASK); > + /* Clear vconfig PME_Support and PME_En bits */ > *ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS); > } > > @@ -1582,8 +1603,10 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev) > if (ret) > return ret; > > - if (cap == PCI_CAP_ID_PM) > + if (cap == PCI_CAP_ID_PM) { > + vdev->pm_cap_offset = pos; > vfio_update_pm_vconfig_bytes(vdev, pos); > + } > > prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT]; > pos = next; > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 1ddaaa6ccef5..6c1225bc2aeb 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -319,14 +319,35 @@ static int vfio_pci_core_runtime_resume(struct device *dev) > * the low power state or closed the device. > * - If there is device access on the host side. > * > - * For the second case, check if re-entry to the low power state is > - * allowed. If not, then increment the usage count so that runtime PM > - * framework won't suspend the device and set the 'pm_runtime_resumed' > - * flag. > + * For the second case: > + * - The virtual PME_STATUS bit will be set. If PME_ENABLE bit is set > + * and user has registered for virtual PME events, then send the PME > + * virtual PME event. > + * - Check if re-entry to the low power state is not allowed. > + * > + * For the above conditions, increment the usage count so that > + * runtime PM framework won't suspend the device and set the > + * 'pm_runtime_resumed' flag. > */ > - if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) { > - pm_runtime_get_noresume(dev); > - vdev->pm_runtime_resumed = true; > + if (vdev->pm_runtime_engaged) { > + bool pme_triggered = false; > + __le16 *ctrl = (__le16 *)&vdev->vconfig > + [vdev->pm_cap_offset + PCI_PM_CTRL]; > + > + *ctrl |= cpu_to_le16(PCI_PM_CTRL_PME_STATUS); > + if (le16_to_cpu(*ctrl) & PCI_PM_CTRL_PME_ENABLE) { > + mutex_lock(&vdev->igate); > + if (vdev->pme_trigger) { > + pme_triggered = true; > + eventfd_signal(vdev->pme_trigger, 1); > + } > + mutex_unlock(&vdev->igate); > + } > + > + if (!vdev->pm_runtime_reentry_allowed || pme_triggered) { > + pm_runtime_get_noresume(dev); > + vdev->pm_runtime_resumed = true; > + } > } > up_write(&vdev->memory_lock); > > @@ -586,6 +607,10 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) > eventfd_ctx_put(vdev->req_trigger); > vdev->req_trigger = NULL; > } > + if (vdev->pme_trigger) { > + eventfd_ctx_put(vdev->pme_trigger); > + vdev->pme_trigger = NULL; > + } > mutex_unlock(&vdev->igate); > } > EXPORT_SYMBOL_GPL(vfio_pci_core_close_device); > @@ -639,7 +664,8 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ > } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) { > if (pci_is_pcie(vdev->pdev)) > return 1; > - } else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) { > + } else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX || > + irq_type == VFIO_PCI_PME_IRQ_INDEX) { > return 1; > } > > @@ -985,6 +1011,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, > switch (info.index) { > case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX: > case VFIO_PCI_REQ_IRQ_INDEX: > + case VFIO_PCI_PME_IRQ_INDEX: > break; > case VFIO_PCI_ERR_IRQ_INDEX: > if (pci_is_pcie(vdev->pdev)) > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c > index 1a37db99df48..db4180687a74 100644 > --- a/drivers/vfio/pci/vfio_pci_intrs.c > +++ b/drivers/vfio/pci/vfio_pci_intrs.c > @@ -639,6 +639,17 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_core_device *vdev, > count, flags, data); > } > > +static int vfio_pci_set_pme_trigger(struct vfio_pci_core_device *vdev, > + unsigned index, unsigned start, > + unsigned count, uint32_t flags, void *data) > +{ > + if (index != VFIO_PCI_PME_IRQ_INDEX || start != 0 || count > 1) > + return -EINVAL; > + > + return vfio_pci_set_ctx_trigger_single(&vdev->pme_trigger, > + count, flags, data); > +} > + > int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags, > unsigned index, unsigned start, unsigned count, > void *data) > @@ -688,6 +699,13 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags, > break; > } > break; > + case VFIO_PCI_PME_IRQ_INDEX: > + switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) { > + case VFIO_IRQ_SET_ACTION_TRIGGER: > + func = vfio_pci_set_pme_trigger; > + break; > + } > + break; > } > > if (!func) > diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h > index 18cc83b767b8..ee2646d820c2 100644 > --- a/include/linux/vfio_pci_core.h > +++ b/include/linux/vfio_pci_core.h > @@ -102,6 +102,7 @@ struct vfio_pci_core_device { > bool bar_mmap_supported[PCI_STD_NUM_BARS]; > u8 *pci_config_map; > u8 *vconfig; > + u8 pm_cap_offset; > struct perm_bits *msi_perm; > spinlock_t irqlock; > struct mutex igate; > @@ -133,6 +134,7 @@ struct vfio_pci_core_device { > int ioeventfds_nr; > struct eventfd_ctx *err_trigger; > struct eventfd_ctx *req_trigger; > + struct eventfd_ctx *pme_trigger; > struct list_head dummy_resources_list; > struct mutex ioeventfds_lock; > struct list_head ioeventfds_list; > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 7e00de5c21ea..08170950d655 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -621,6 +621,7 @@ enum { > VFIO_PCI_MSIX_IRQ_INDEX, > VFIO_PCI_ERR_IRQ_INDEX, > VFIO_PCI_REQ_IRQ_INDEX, > + VFIO_PCI_PME_IRQ_INDEX, > VFIO_PCI_NUM_IRQS > }; >