On 7/6/2022 9:10 PM, Alex Williamson wrote: > On Fri, 1 Jul 2022 16:38:14 +0530 > Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > >> If the PCI device is in low power state and the device requires >> wake-up, then it can generate PME (Power Management Events). Mostly >> these PME events will be propagated to the root port and then the >> root port will generate the system interrupt. Then the OS should >> identify the device which generated the PME and should resume >> the device. >> >> We can implement a similar virtual PME framework where if the device >> already went into the runtime suspended state and then there is any >> wake-up on the host side, then it will send the virtual PME >> notification to the guest. This virtual PME will be helpful for the cases >> where the device will not be suspended again if there is any wake-up >> triggered by the host. Following is the overall approach regarding >> the virtual PME. >> >> 1. Add one more event like VFIO_PCI_ERR_IRQ_INDEX named >> VFIO_PCI_PME_IRQ_INDEX and do the required code changes to get/set >> this new IRQ. >> >> 2. From the guest side, the guest needs to enable eventfd for the >> virtual PME notification. >> >> 3. In the vfio-pci driver, the PME support bits are currently >> virtualized and set to 0. We can set PME capability support for all >> the power states. This PME capability support is independent of the >> physical PME support. >> >> 4. The PME enable (PME_En bit in Power Management Control/Status >> Register) and PME status (PME_Status bit in Power Management >> Control/Status Register) are also virtualized currently. >> The write support for PME_En bit can be enabled. >> >> 5. The PME_Status bit is a write-1-clear bit where the write with >> zero value will have no effect and write with 1 value will clear the >> bit. The write for this bit will be trapped inside >> vfio_pm_config_write() similar to PCI_PM_CTRL write for PM_STATES. >> >> 6. When the host gets a request for resuming the device other than from >> low power exit feature IOCTL, then PME_Status bit will be set. >> According to [PCIe v5 7.5.2.2], >> "PME_Status - This bit is Set when the Function would normally >> generate a PME signal. The value of this bit is not affected by >> the value of the PME_En bit." >> >> So even if PME_En bit is not set, we can set PME_Status bit. >> >> 7. If the guest has enabled PME_En and registered for PME events >> through eventfd, then the usage count will be incremented to prevent >> the device to go into the suspended state and notify the guest through >> eventfd trigger. >> >> The virtual PME can help in handling physical PME also. When >> physical PME comes, then also the runtime resume will be called. If >> the guest has registered for virtual PME, then it will be sent in this >> case also. >> >> * Implementation for handling the virtual PME on the hypervisor: >> >> If we take the implementation in Linux OS, then during runtime suspend >> time, then it calls __pci_enable_wake(). It internally enables PME >> through pci_pme_active() and also enables the ACPI side wake-up >> through platform_pci_set_wakeup(). To handle the PME, the hypervisor has >> the following two options: >> >> 1. Create a virtual root port for the VFIO device and trigger >> interrupt when the PME comes. It will call pcie_pme_irq() which will >> resume the device. >> >> 2. Create a virtual ACPI _PRW resource and associate it with the device >> itself. In _PRW, any GPE (General Purpose Event) can be assigned for >> the wake-up. When PME comes, then GPE can be triggered by the >> hypervisor. GPE interrupt will call pci_acpi_wake_dev() function >> internally and it will resume the device. > > Do we really need to implement PME emulation in the kernel or is it > sufficient for userspace to simply register a one-shot eventfd when > SET'ing the low power feature and QEMU can provide the PME emulation > based on that signaling? Thanks, > > Alex > It seems it can be implemented in QEMU instead of kernel. I will drop this patch. Thanks, Abhishek >> >> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx> >> --- >> drivers/vfio/pci/vfio_pci_config.c | 39 +++++++++++++++++++++------ >> drivers/vfio/pci/vfio_pci_core.c | 43 ++++++++++++++++++++++++------ >> drivers/vfio/pci/vfio_pci_intrs.c | 18 +++++++++++++ >> include/linux/vfio_pci_core.h | 2 ++ >> include/uapi/linux/vfio.h | 1 + >> 5 files changed, 87 insertions(+), 16 deletions(-) >> >> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c >> index 21a4743d011f..a06375a03758 100644 >> --- a/drivers/vfio/pci/vfio_pci_config.c >> +++ b/drivers/vfio/pci/vfio_pci_config.c >> @@ -719,6 +719,20 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos, >> if (count < 0) >> return count; >> >> + /* >> + * PME_STATUS is write-1-clear bit. If PME_STATUS is 1, then clear the >> + * bit in vconfig. The PME_STATUS is in the upper byte of the control >> + * register and user can do single byte write also. >> + */ >> + if (offset <= PCI_PM_CTRL + 1 && offset + count > PCI_PM_CTRL + 1) { >> + if (le32_to_cpu(val) & >> + (PCI_PM_CTRL_PME_STATUS >> (offset - PCI_PM_CTRL) * 8)) { >> + __le16 *ctrl = (__le16 *)&vdev->vconfig >> + [vdev->pm_cap_offset + PCI_PM_CTRL]; >> + *ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_STATUS); >> + } >> + } >> + >> if (offset == PCI_PM_CTRL) { >> pci_power_t state; >> >> @@ -771,14 +785,16 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm) >> * the user change power state, but we trap and initiate the >> * change ourselves, so the state bits are read-only. >> * >> - * The guest can't process PME from D3cold so virtualize PME_Status >> - * and PME_En bits. The vconfig bits will be cleared during device >> - * capability initialization. >> + * The guest can't process physical PME from D3cold so virtualize >> + * PME_Status and PME_En bits. These bits will be used for the >> + * virtual PME between host and guest. The vconfig bits will be >> + * updated during device capability initialization. PME_Status is >> + * write-1-clear bit, so it is read-only. We trap and update the >> + * vconfig bit manually during write. >> */ >> p_setd(perm, PCI_PM_CTRL, >> PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS, >> - ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS | >> - PCI_PM_CTRL_STATE_MASK)); >> + ~(PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_PME_STATUS)); >> >> return 0; >> } >> @@ -1454,8 +1470,13 @@ static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev, >> __le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC]; >> __le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL]; >> >> - /* Clear vconfig PME_Support, PME_Status, and PME_En bits */ >> - *pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK); >> + /* >> + * Set the vconfig PME_Support bits. The PME_Status is being used for >> + * virtual PME support and is not dependent upon the physical >> + * PME support. >> + */ >> + *pmc |= cpu_to_le16(PCI_PM_CAP_PME_MASK); >> + /* Clear vconfig PME_Support and PME_En bits */ >> *ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS); >> } >> >> @@ -1582,8 +1603,10 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev) >> if (ret) >> return ret; >> >> - if (cap == PCI_CAP_ID_PM) >> + if (cap == PCI_CAP_ID_PM) { >> + vdev->pm_cap_offset = pos; >> vfio_update_pm_vconfig_bytes(vdev, pos); >> + } >> >> prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT]; >> pos = next; >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c >> index 1ddaaa6ccef5..6c1225bc2aeb 100644 >> --- a/drivers/vfio/pci/vfio_pci_core.c >> +++ b/drivers/vfio/pci/vfio_pci_core.c >> @@ -319,14 +319,35 @@ static int vfio_pci_core_runtime_resume(struct device *dev) >> * the low power state or closed the device. >> * - If there is device access on the host side. >> * >> - * For the second case, check if re-entry to the low power state is >> - * allowed. If not, then increment the usage count so that runtime PM >> - * framework won't suspend the device and set the 'pm_runtime_resumed' >> - * flag. >> + * For the second case: >> + * - The virtual PME_STATUS bit will be set. If PME_ENABLE bit is set >> + * and user has registered for virtual PME events, then send the PME >> + * virtual PME event. >> + * - Check if re-entry to the low power state is not allowed. >> + * >> + * For the above conditions, increment the usage count so that >> + * runtime PM framework won't suspend the device and set the >> + * 'pm_runtime_resumed' flag. >> */ >> - if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) { >> - pm_runtime_get_noresume(dev); >> - vdev->pm_runtime_resumed = true; >> + if (vdev->pm_runtime_engaged) { >> + bool pme_triggered = false; >> + __le16 *ctrl = (__le16 *)&vdev->vconfig >> + [vdev->pm_cap_offset + PCI_PM_CTRL]; >> + >> + *ctrl |= cpu_to_le16(PCI_PM_CTRL_PME_STATUS); >> + if (le16_to_cpu(*ctrl) & PCI_PM_CTRL_PME_ENABLE) { >> + mutex_lock(&vdev->igate); >> + if (vdev->pme_trigger) { >> + pme_triggered = true; >> + eventfd_signal(vdev->pme_trigger, 1); >> + } >> + mutex_unlock(&vdev->igate); >> + } >> + >> + if (!vdev->pm_runtime_reentry_allowed || pme_triggered) { >> + pm_runtime_get_noresume(dev); >> + vdev->pm_runtime_resumed = true; >> + } >> } >> up_write(&vdev->memory_lock); >> >> @@ -586,6 +607,10 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) >> eventfd_ctx_put(vdev->req_trigger); >> vdev->req_trigger = NULL; >> } >> + if (vdev->pme_trigger) { >> + eventfd_ctx_put(vdev->pme_trigger); >> + vdev->pme_trigger = NULL; >> + } >> mutex_unlock(&vdev->igate); >> } >> EXPORT_SYMBOL_GPL(vfio_pci_core_close_device); >> @@ -639,7 +664,8 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ >> } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) { >> if (pci_is_pcie(vdev->pdev)) >> return 1; >> - } else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) { >> + } else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX || >> + irq_type == VFIO_PCI_PME_IRQ_INDEX) { >> return 1; >> } >> >> @@ -985,6 +1011,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, >> switch (info.index) { >> case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX: >> case VFIO_PCI_REQ_IRQ_INDEX: >> + case VFIO_PCI_PME_IRQ_INDEX: >> break; >> case VFIO_PCI_ERR_IRQ_INDEX: >> if (pci_is_pcie(vdev->pdev)) >> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c >> index 1a37db99df48..db4180687a74 100644 >> --- a/drivers/vfio/pci/vfio_pci_intrs.c >> +++ b/drivers/vfio/pci/vfio_pci_intrs.c >> @@ -639,6 +639,17 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_core_device *vdev, >> count, flags, data); >> } >> >> +static int vfio_pci_set_pme_trigger(struct vfio_pci_core_device *vdev, >> + unsigned index, unsigned start, >> + unsigned count, uint32_t flags, void *data) >> +{ >> + if (index != VFIO_PCI_PME_IRQ_INDEX || start != 0 || count > 1) >> + return -EINVAL; >> + >> + return vfio_pci_set_ctx_trigger_single(&vdev->pme_trigger, >> + count, flags, data); >> +} >> + >> int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags, >> unsigned index, unsigned start, unsigned count, >> void *data) >> @@ -688,6 +699,13 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags, >> break; >> } >> break; >> + case VFIO_PCI_PME_IRQ_INDEX: >> + switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) { >> + case VFIO_IRQ_SET_ACTION_TRIGGER: >> + func = vfio_pci_set_pme_trigger; >> + break; >> + } >> + break; >> } >> >> if (!func) >> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h >> index 18cc83b767b8..ee2646d820c2 100644 >> --- a/include/linux/vfio_pci_core.h >> +++ b/include/linux/vfio_pci_core.h >> @@ -102,6 +102,7 @@ struct vfio_pci_core_device { >> bool bar_mmap_supported[PCI_STD_NUM_BARS]; >> u8 *pci_config_map; >> u8 *vconfig; >> + u8 pm_cap_offset; >> struct perm_bits *msi_perm; >> spinlock_t irqlock; >> struct mutex igate; >> @@ -133,6 +134,7 @@ struct vfio_pci_core_device { >> int ioeventfds_nr; >> struct eventfd_ctx *err_trigger; >> struct eventfd_ctx *req_trigger; >> + struct eventfd_ctx *pme_trigger; >> struct list_head dummy_resources_list; >> struct mutex ioeventfds_lock; >> struct list_head ioeventfds_list; >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h >> index 7e00de5c21ea..08170950d655 100644 >> --- a/include/uapi/linux/vfio.h >> +++ b/include/uapi/linux/vfio.h >> @@ -621,6 +621,7 @@ enum { >> VFIO_PCI_MSIX_IRQ_INDEX, >> VFIO_PCI_ERR_IRQ_INDEX, >> VFIO_PCI_REQ_IRQ_INDEX, >> + VFIO_PCI_PME_IRQ_INDEX, >> VFIO_PCI_NUM_IRQS >> }; >> >