On 7/6/2022 9:09 PM, Alex Williamson wrote: > On Fri, 1 Jul 2022 16:38:10 +0530 > Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > >> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT >> for the power management in the header file. The implementation for the >> same will be added in the subsequent patches. >> >> With the standard registers, all power states cannot be achieved. The >> platform-based power management needs to be involved to go into the >> lowest power state. For all the platform-based power management, this >> device feature can be used. >> >> This device feature uses flags to specify the different operations. In >> the future, if any more power management functionality is needed then >> a new flag can be added to it. It supports both GET and SET operations. >> >> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx> >> --- >> include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 55 insertions(+) >> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h >> index 733a1cddde30..7e00de5c21ea 100644 >> --- a/include/uapi/linux/vfio.h >> +++ b/include/uapi/linux/vfio.h >> @@ -986,6 +986,61 @@ enum vfio_device_mig_state { >> VFIO_DEVICE_STATE_RUNNING_P2P = 5, >> }; >> >> +/* >> + * Perform power management-related operations for the VFIO device. >> + * >> + * The low power feature uses platform-based power management to move the >> + * device into the low power state. This low power state is device-specific. >> + * >> + * This device feature uses flags to specify the different operations. >> + * It supports both the GET and SET operations. >> + * >> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power >> + * state with platform-based power management. This low power state will be >> + * internal to the VFIO driver and the user will not come to know which power >> + * state is chosen. Once the user has moved the VFIO device into the low >> + * power state, then the user should not do any device access without moving >> + * the device out of the low power state. > > Except we're wrapping device accesses to make this possible. This > should probably describe how any discrete access will wake the device > but ongoing access through mmaps will generate user faults. > Sure. I will add that details also. >> + * >> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power >> + * state. This flag should only be set if the user has previously put the >> + * device into low power state with the VFIO_PM_LOW_POWER_ENTER flag. > > Indenting. > I will fix this. >> + * >> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive. >> + * >> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with >> + * VFIO_PM_LOW_POWER_ENTER. If there is any access for the VFIO device on >> + * the host side, then the device will be moved out of the low power state >> + * without the user's guest driver involvement. Some devices require the >> + * user's guest driver involvement for each low-power entry. If this flag is >> + * set, then the re-entry to the low power state will be disabled, and the >> + * host kernel will not move the device again into the low power state. >> + * The VFIO driver internally maintains a list of devices for which low >> + * power re-entry is disabled by default and for those devices, the >> + * re-entry will be disabled even if the user has not set this flag >> + * explicitly. > > Wrong polarity. The kernel should not maintain the policy. By default > every wakeup, whether from host kernel accesses or via user accesses > that do a pm-get should signal a wakeup to userspace. Userspace needs > to opt-out of that wakeup to let the kernel automatically re-enter low > power and userspace needs to maintain the policy for which devices it > wants that to occur. > Okay. So that means, in the kernel side, we don’t have to maintain the list which currently contains NVIDIA device ID. Also, in our updated approach, this opt-out of that wake-up means that user has not provided eventfd in the feature SET ioctl. Correct ? >> + * >> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET: >> + * >> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into >> + * the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set. >> + * >> + * - If the device is in a normal power state currently, then >> + * VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low >> + * power re-entry is disabled by default. If the device is in the low power >> + * state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set >> + * according to the current transition. > > Very confusing semantics. > > What if the feature SET ioctl took an eventfd and that eventfd was one > time use. Calling the ioctl would setup the eventfd to notify the user > on wakeup and call pm-put. Any access to the device via host, ioctl, > or region would be wrapped in pm-get/put and the pm-resume handler > would perform the matching pm-get to balance the feature SET and signal > the eventfd. This seems a better option. It will help in making the ioctl simpler and we don’t have to add a separate index for PME which I added in patch 6. > If the user opts-out by not providing a wakeup eventfd, > then the pm-resume handler does not perform a pm-get. Possibly we > could even allow mmap access if a wake-up eventfd is provided. Sorry. I am not clear on this mmap part. We currently invalidates mapping before going into runtime-suspend. Now, if use tries do mmap then do we need some extra handling in the fault handler ? Need your help in understanding this part. > The > feature GET ioctl would be used to exit low power behavior and would be > a no-op if the wakeup eventfd had already been signaled. Thanks, > I will use the GET ioctl for low power exit instead of returning the current status. Regards, Abhishek > Alex > >> + */ >> +struct vfio_device_feature_power_management { >> + __u32 flags; >> +#define VFIO_PM_LOW_POWER_ENTER (1 << 0) >> +#define VFIO_PM_LOW_POWER_EXIT (1 << 1) >> +#define VFIO_PM_LOW_POWER_REENTERY_DISABLE (1 << 2) >> + __u32 reserved; >> +}; >> + >> +#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT 3 >> + >> /* -------- API for Type1 VFIO IOMMU -------- */ >> >> /** >