On Mon, 11 Jul 2022 15:13:13 +0530 Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > On 7/8/2022 10:06 PM, Alex Williamson wrote: > > On Fri, 8 Jul 2022 15:09:22 +0530 > > Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > > > >> On 7/6/2022 9:09 PM, Alex Williamson wrote: > >>> On Fri, 1 Jul 2022 16:38:10 +0530 > >>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > >>> > >>>> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT > >>>> for the power management in the header file. The implementation for the > >>>> same will be added in the subsequent patches. > >>>> > >>>> With the standard registers, all power states cannot be achieved. The > >>>> platform-based power management needs to be involved to go into the > >>>> lowest power state. For all the platform-based power management, this > >>>> device feature can be used. > >>>> > >>>> This device feature uses flags to specify the different operations. In > >>>> the future, if any more power management functionality is needed then > >>>> a new flag can be added to it. It supports both GET and SET operations. > >>>> > >>>> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx> > >>>> --- > >>>> include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++ > >>>> 1 file changed, 55 insertions(+) > >>>> > >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > >>>> index 733a1cddde30..7e00de5c21ea 100644 > >>>> --- a/include/uapi/linux/vfio.h > >>>> +++ b/include/uapi/linux/vfio.h > >>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state { > >>>> VFIO_DEVICE_STATE_RUNNING_P2P = 5, > >>>> }; > >>>> > >>>> +/* > >>>> + * Perform power management-related operations for the VFIO device. > >>>> + * > >>>> + * The low power feature uses platform-based power management to move the > >>>> + * device into the low power state. This low power state is device-specific. > >>>> + * > >>>> + * This device feature uses flags to specify the different operations. > >>>> + * It supports both the GET and SET operations. > >>>> + * > >>>> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power > >>>> + * state with platform-based power management. This low power state will be > >>>> + * internal to the VFIO driver and the user will not come to know which power > >>>> + * state is chosen. Once the user has moved the VFIO device into the low > >>>> + * power state, then the user should not do any device access without moving > >>>> + * the device out of the low power state. > >>> > >>> Except we're wrapping device accesses to make this possible. This > >>> should probably describe how any discrete access will wake the device > >>> but ongoing access through mmaps will generate user faults. > >>> > >> > >> Sure. I will add that details also. > >> > >>>> + * > >>>> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power > >>>> + * state. This flag should only be set if the user has previously put the > >>>> + * device into low power state with the VFIO_PM_LOW_POWER_ENTER flag. > >>> > >>> Indenting. > >>> > >> > >> I will fix this. > >> > >>>> + * > >>>> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive. > >>>> + * > >>>> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with > >>>> + * VFIO_PM_LOW_POWER_ENTER. If there is any access for the VFIO device on > >>>> + * the host side, then the device will be moved out of the low power state > >>>> + * without the user's guest driver involvement. Some devices require the > >>>> + * user's guest driver involvement for each low-power entry. If this flag is > >>>> + * set, then the re-entry to the low power state will be disabled, and the > >>>> + * host kernel will not move the device again into the low power state. > >>>> + * The VFIO driver internally maintains a list of devices for which low > >>>> + * power re-entry is disabled by default and for those devices, the > >>>> + * re-entry will be disabled even if the user has not set this flag > >>>> + * explicitly. > >>> > >>> Wrong polarity. The kernel should not maintain the policy. By default > >>> every wakeup, whether from host kernel accesses or via user accesses > >>> that do a pm-get should signal a wakeup to userspace. Userspace needs > >>> to opt-out of that wakeup to let the kernel automatically re-enter low > >>> power and userspace needs to maintain the policy for which devices it > >>> wants that to occur. > >>> > >> > >> Okay. So that means, in the kernel side, we don’t have to maintain > >> the list which currently contains NVIDIA device ID. Also, in our > >> updated approach, this opt-out of that wake-up means that user > >> has not provided eventfd in the feature SET ioctl. Correct ? > > > > Yes, I'm imagining that if the user hasn't provided a one-shot wake-up > > eventfd, that's the opt-out for being notified of device wakes. For > > example, pm-resume would have something like: > > > > > > if (vdev->pm_wake_eventfd) { > > eventfd_signal(vdev->pm_wake_eventfd, 1); > > vdev->pm_wake_eventfd = NULL; > > pm_runtime_get_noresume(dev); > > } > > > > (eventfd pseudo handling substantially simplified) > > > > So w/o a wake-up eventfd, the user would need to call the pm feature > > exit ioctl to elevate the pm reference to prevent it going back to low > > power. The pm feature exit ioctl would be optional if a wake eventfd is > > provided, so some piece of the eventfd context would need to remain to > > determine whether a pm-get is necessary. > > > >>>> + * > >>>> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET: > >>>> + * > >>>> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into > >>>> + * the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set. > >>>> + * > >>>> + * - If the device is in a normal power state currently, then > >>>> + * VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low > >>>> + * power re-entry is disabled by default. If the device is in the low power > >>>> + * state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set > >>>> + * according to the current transition. > >>> > >>> Very confusing semantics. > >>> > >>> What if the feature SET ioctl took an eventfd and that eventfd was one > >>> time use. Calling the ioctl would setup the eventfd to notify the user > >>> on wakeup and call pm-put. Any access to the device via host, ioctl, > >>> or region would be wrapped in pm-get/put and the pm-resume handler > >>> would perform the matching pm-get to balance the feature SET and signal > >>> the eventfd. > >> > >> This seems a better option. It will help in making the ioctl simpler > >> and we don’t have to add a separate index for PME which I added in > >> patch 6. > >> > >>> If the user opts-out by not providing a wakeup eventfd, > >>> then the pm-resume handler does not perform a pm-get. Possibly we > >>> could even allow mmap access if a wake-up eventfd is provided. > >> > >> Sorry. I am not clear on this mmap part. We currently invalidates > >> mapping before going into runtime-suspend. Now, if use tries do > >> mmap then do we need some extra handling in the fault handler ? > >> Need your help in understanding this part. > > > > The option that I'm thinking about is if the mmap fault handler is > > wrapped in a pm-get/put then we could actually populate the mmap. In > > the case where the pm-get triggers the wake-eventfd in pm-resume, the > > device doesn't return to low power when the mmap fault handler calls > > pm-put. This possibly allows that we could actually invalidate mmaps on > > pm-suspend rather than in the pm feature enter ioctl, essentially the > > same as we're doing for intx. I wonder though if this allows the > > possibility that we just bounce between mmap fault and pm-suspend. So > > long as some work can be done, for instance the pm-suspend occurs > > asynchronously to the pm-put, this might be ok. > > > > We can do this. But in the normal use case, the situation should > never arise where user should access any mmaped region when user has > already put the device into D3 (D3hot or D3cold). This can only happen > if there is some bug in the guest driver or user is doing wrong > sequence. Do we need to add handling to officially support this part ? We cannot rely on userspace drivers to be bug free or non-malicious, but if we want to impose that an mmap access while low power is enabled always triggers a fault, that's ok. > pm-get can take more than a second for resume for some devices and > will doing this in fault handler be safe ? > > Also, we will add this support only when wake-eventfd is provided so > still w/o wake-eventfd case, the mmap access will still generate fault. > So, we will have different behavior. Will that be acceptable ? Let's keep it simple, generate a fault for all cases. > >>> The > >>> feature GET ioctl would be used to exit low power behavior and would be > >>> a no-op if the wakeup eventfd had already been signaled. Thanks, > >>> > >> > >> I will use the GET ioctl for low power exit instead of returning the > >> current status. > > > > Note that Yishai is proposing a device DMA dirty logging feature where > > the stop and start are exposed via SET on separate features, rather > > than SET/GET. We should probably maintain some consistency between > > these use cases. Possibly we might even want two separate pm enter > > ioctls, one with the wake eventfd and one without. I think this is the > > sort of thing Jason is describing for future expansion of the dirty > > tracking uAPI. Thanks, > > > > Alex > > > > Okay. So, we need to add 3 device features in total. > > VFIO_DEVICE_FEATURE_PM_ENTRY > VFIO_DEVICE_FEATURE_PM_ENTRY_WITH_WAKEUP > VFIO_DEVICE_FEATURE_PM_EXIT > > And only the second one need structure which will have only one field > for eventfd and we need to return error if wakeup-eventfd is not > provided in the second feature ? Yes, we'd use eventfd_ctx and fail on a bad fileget. > Do we need to support GET operation also for these ? > We can skip GET operation since that won’t be very useful. What would they do? Thanks, Alex