Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 7 Jun 2022 15:50:31 -0600

On Fri, 3 Jun 2022 15:49:27 +0530
Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:

> On 6/2/2022 11:14 PM, Alex Williamson wrote:
> > On Thu, 2 Jun 2022 17:22:03 +0530
> > Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
> >   
> >> On 6/1/2022 9:51 PM, Alex Williamson wrote:  
> >>> On Wed, 1 Jun 2022 15:19:07 +0530
> >>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
> >>>     
> >>>> On 6/1/2022 4:22 AM, Alex Williamson wrote:    
> >>>>> On Tue, 31 May 2022 16:43:04 -0300
> >>>>> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> >>>>>       
> >>>>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:      
> >>>>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:        
> >>>>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
> >>>>>>>>         
> >>>>>>>>>  1. In real use case, config or any other ioctl should not come along
> >>>>>>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
> >>>>>>>>>  
> >>>>>>>>>  2. Maintain some 'access_count' which will be incremented when we
> >>>>>>>>>     do any config space access or ioctl.        
> >>>>>>>>
> >>>>>>>> Please don't open code locks - if you need a lock then write a proper
> >>>>>>>> lock. You can use the 'try' variants to bail out in cases where that
> >>>>>>>> is appropriate.
> >>>>>>>>
> >>>>>>>> Jason        
> >>>>>>>
> >>>>>>>  Thanks Jason for providing your inputs.
> >>>>>>>
> >>>>>>>  In that case, should I introduce new rw_semaphore (For example
> >>>>>>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?        
> >>>>>>
> >>>>>> Possibly, this is better than an atomic at least
> >>>>>>      
> >>>>>>>  1. At the beginning of config space access or ioctl, we can take the
> >>>>>>>     lock
> >>>>>>>  
> >>>>>>>      down_read(&vdev->power_lock);        
> >>>>>>
> >>>>>> You can also do down_read_trylock() here and bail out as you were
> >>>>>> suggesting with the atomic.
> >>>>>>
> >>>>>> trylock doesn't have lock odering rules because it can't sleep so it
> >>>>>> gives a bit more flexability when designing the lock ordering.
> >>>>>>
> >>>>>> Though userspace has to be able to tolerate the failure, or never make
> >>>>>> the request.
> >>>>>>      
> >>>>
> >>>>  Thanks Alex and Jason for providing your inputs.
> >>>>
> >>>>  Using down_read_trylock() along with Alex suggestion seems fine.
> >>>>  In real use case, config space access should not happen when the
> >>>>  device is in low power state so returning error should not
> >>>>  cause any issue in this case.
> >>>>    
> >>>>>>>          down_write(&vdev->power_lock);
> >>>>>>>          ...
> >>>>>>>          switch (vfio_pm.low_power_state) {
> >>>>>>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
> >>>>>>>                  ...
> >>>>>>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
> >>>>>>>                          vdev->power_state_d3 = true;
> >>>>>>>                          up_write(&vdev->memory_lock);
> >>>>>>>
> >>>>>>>          ...
> >>>>>>>          up_write(&vdev->power_lock);        
> >>>>>>
> >>>>>> And something checks the power lock before allowing the memor to be
> >>>>>> re-enabled?
> >>>>>>      
> >>>>>>>  4.  For ioctl access, as mentioned previously I need to add two
> >>>>>>>      callbacks functions (one for start and one for end) in the struct
> >>>>>>>      vfio_device_ops and call the same at start and end of ioctl from
> >>>>>>>      vfio_device_fops_unl_ioctl().        
> >>>>>>
> >>>>>> Not sure I followed this..      
> >>>>>
> >>>>> I'm kinda lost here too.      
> >>>>
> >>>>
> >>>>  I have summarized the things below
> >>>>
> >>>>  1. In the current patch (v3 8/8), if config space access or ioctl was
> >>>>     being made by the user when the device is already in low power state,
> >>>>     then it was waking the device. This wake up was happening with
> >>>>     pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
> >>>>     vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).
> >>>>
> >>>>  2. Now, it has been decided to return error instead of waking the
> >>>>     device if the device is already in low power state.
> >>>>
> >>>>  3. Initially I thought to add following code in config space path
> >>>>     (and similar in ioctl)
> >>>>
> >>>>         vfio_pci_config_rw() {
> >>>>             ...
> >>>>             down_read(&vdev->memory_lock);
> >>>>             if (vdev->platform_pm_engaged)
> >>>>             {
> >>>>                 up_read(&vdev->memory_lock);
> >>>>                 return -EIO;
> >>>>             }
> >>>>             ...
> >>>>         }
> >>>>
> >>>>      And then there was a possibility that the physical config happens
> >>>>      when the device in D3cold in case of race condition.
> >>>>
> >>>>  4.  So, I wanted to add some mechanism so that the low power entry
> >>>>      ioctl will be serialized with other ioctl or config space. With this
> >>>>      if low power entry gets scheduled first then config/other ioctls will
> >>>>      get failure, otherwise low power entry will wait.
> >>>>
> >>>>  5.  For serializing this access, I need to ensure that lock is held
> >>>>      throughout the operation. For config space I can add the code in
> >>>>      vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
> >>>>      way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
> >>>>      VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
> >>>>      vfio core layer itself.
> >>>>
> >>>>  The memory_lock and the variables to track low power in specific to
> >>>>  vfio-pci so I need some mechanism by which I add low power check for
> >>>>  each ioctl. For serialization, I need to call function implemented in
> >>>>  vfio-pci before vfio core layer makes the actual ioctl to grab the
> >>>>  locks. Similarly, I need to release the lock once vfio core layer
> >>>>  finished the actual ioctl. I have mentioned about this problem in the
> >>>>  above point (point 4 in my earlier mail).
> >>>>    
> >>>>> A couple replies back there was some concern
> >>>>> about race scenarios with multiple user threads accessing the device.
> >>>>> The ones concerning non-deterministic behavior if a user is
> >>>>> concurrently changing power state and performing other accesses are a
> >>>>> non-issue, imo.        
> >>>>
> >>>>  What does non-deterministic behavior here mean.
> >>>>  Is it for user side that user will see different result
> >>>>  (failure or success) during race condition or in the kernel side
> >>>>  (as explained in point 3 above where physical config access
> >>>>  happens when the device in D3cold) ? My concern here is for later
> >>>>  part where this config space access in D3cold can cause fatal error
> >>>>  on the system side as we have seen for memory disablement.    
> >>>
> >>> Yes, our only concern should be to prevent such an access.  The user
> >>> seeing non-deterministic behavior, such as during concurrent power
> >>> control and config space access, all combinations of success/failure
> >>> are possible, is par for the course when we decide to block accesses
> >>> across the life of the low power state.
> >>>      
> >>>>> I think our goal is only to expand the current
> >>>>> memory_lock to block accesses, including config space, while the device
> >>>>> is in low power, or some approximation bounded by the entry/exit ioctl.
> >>>>>
> >>>>> I think the remaining issues is how to do that relative to the fact
> >>>>> that config space access can change the memory enable state and would
> >>>>> therefore need to upgrade the memory_lock read-lock to a write-lock.
> >>>>> For that I think we can simply drop the read-lock, acquire the
> >>>>> write-lock, and re-test the low power state.  If it has changed, that
> >>>>> suggests the user has again raced changing power state with another
> >>>>> access and we can simply drop the lock and return -EIO.
> >>>>>       
> >>>>
> >>>>  Yes. This looks better option. So, just to confirm, I can take the
> >>>>  memory_lock read-lock at the starting of vfio_pci_config_rw() and
> >>>>  release it just before returning from vfio_pci_config_rw() and
> >>>>  for memory related config access, we will release this lock and
> >>>>  re-aquiring again write version of this. Once memory write happens,
> >>>>  then we can downgrade this write lock to read lock ?    
> >>>
> >>> We only need to lock for the device access, so if you've finished that
> >>> access after acquiring the write-lock, there'd be no point to then
> >>> downgrade that to a read-lock.  The access should be finished by that
> >>> point.
> >>>    
> >>
> >>  I was planning to take memory_lock read-lock at the beginning of
> >>  vfio_pci_config_rw() and release the same just before returning from
> >>  this function. If I don't downgrade it back to read-lock, then the
> >>  release in the end will be called for the lock which has not taken.
> >>  Also, user can specify count to any number of bytes and then the
> >>  vfio_config_do_rw() will be invoked multiple times and then in
> >>  the second call, it will be without lock.  
> > 
> > Ok, yes, I can imagine how it might result in a cleaner exit path to do
> > a downgrade_write().
> >   
> >>>>  Also, what about IOCTLs. How can I take and release memory_lock for
> >>>>  ioctl. is it okay to go with Patch 7 where we call
> >>>>  pm_runtime_resume_and_get() before each ioctl or we need to do the
> >>>>  same low power check for ioctl also ?
> >>>>  In Later case, I am not sure how should I do the implementation so
> >>>>  that all other ioctl are covered from vfio core layer itself.    
> >>>
> >>> Some ioctls clearly cannot occur while the device is in low power, such
> >>> as resets and interrupt control, but even less obvious things like
> >>> getting region info require device access.  Migration also provides a
> >>> channel to device access.  Do we want to manage a list of ioctls that
> >>> are allowed in low power, or do we only want to allow the ioctl to exit
> >>> low power?
> >>>     
> >>
> >>  In previous version of this patch, you mentioned that maintaining the
> >>  safe ioctl list will be tough to maintain. So, currently we wanted to
> >>  allow the ioctl for low power exit.  
> > 
> > Yes, I'm still conflicted in how that would work.
> >    
> >>> I'm also still curious how we're going to handle devices that cannot
> >>> return to low power such as the self-refresh mode on the GPU.  We can
> >>> potentially prevent any wake-ups from the vfio device interface, but
> >>> that doesn't preclude a wake-up via an external lspci.  I think we need
> >>> to understand how we're going to handle such devices before we can
> >>> really complete the design.  AIUI, we cannot disable the self-refresh
> >>> sleep mode without imposing unreasonable latency and memory
> >>> requirements on the guest and we cannot retrigger the self-refresh
> >>> low-power mode without non-trivial device specific code.  Thanks,
> >>>
> >>> Alex
> >>>     
> >>
> >>  I am working on adding support to notify guest through virtual PME
> >>  whenever there is any wake-up triggered by the host and the guest has
> >>  already put the device into runtime suspended state. This virtual PME
> >>  will be similar to physical PME. Normally, if PCI device need power
> >>  management transition, then it sends PME event which will be
> >>  ultimately handled by host OS. In virtual PME case, if host need power
> >>  management transition, then it sends event to guest and then guest OS
> >>  handles these virtual PME events. Following is summary:
> >>
> >>  1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX
> >>     named VFIO_PCI_PME_IRQ_INDEX and add the required code for this
> >>     virtual PME event.
> >>
> >>  2. From the guest side, when the PME_IRQ is enabled then we will
> >>     set event_fd for PME.
> >>
> >>  3. In the vfio driver, the PME support bits are already
> >>     virtualized and currently set to 0. We can set PME capability support
> >>     for D3cold so that in guest, it looks like
> >>
> >>      Capabilities: [60] Power Management version 3
> >>      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> >>             PME(D0-,D1-,D2-,D3hot-,D3cold+)
> >>
> >>  4. From the guest side, it can do PME enable (PME_En bit in Power
> >>     Management Control/Status Register) which will be again virtualized.
> >>
> >>  5. When host gets request for resuming the device other than from
> >>     low power ioctl, then device pm usage count will be incremented, the
> >>     PME status (PME_Status bit in Power Management Control/Status Register)
> >>     will be set and then we can do the event_fd signal.
> >>
> >>  6. In the PCIe, the PME events will be handled by root port. For
> >>     using low power D3cold feature, it is required to create virtual root
> >>     port in hypervisor side and when hypervisor receives this PME event,
> >>     then it can send virtual interrupt to root port.
> >>
> >>  7. If we take example of Linux kernel, then pcie_pme_irq() will
> >>     handle this and then do the runtime resume on the guest side. Also, it
> >>     will clear the PME status bit here. Then guest can put the device
> >>     again into suspended state.
> >>
> >>  8. I did prototype changes in QEMU for above logic and was getting wake-up
> >>     in the guest whenever I do lspci on the host side.
> >>
> >>  9. Since currently only nvidia GPU has this limitation to require
> >>     driver interaction each time before going into D3cold so we can allow
> >>     the reentry for other device. We can have nvidia vendor (along with
> >>     VGA/3D controller class code). In future, if any other device also has
> >>     similar requirement then we can update this list. For other device
> >>     host can put the device into D3cold in case of any wake-up.
> >>
> >>  10. In the vfio driver, we can put all these restriction for
> >>      enabling PME and return error if user tries to make low power entry
> >>      ioctl without enabling the PME related things.
> >>
> >>  11. The virtual PME can help in handling physical PME also for all
> >>      the devices. The PME logic is not dependent upon nvidia GPU
> >>      restriction. If virtual PME is enabled by hypervisor, then when
> >>      physical PME wakes the device, then it will resume on the guest side
> >>      also.  
> > 
> > So if host accesses through things like lspci are going to wake the
> > device and we can't prevent that, and the solution to that is to notify
> > the guest to put the device back to low power, then it seems a lot less
> > important to try to prevent the user from waking the device through
> > random accesses.  In that context, maybe we do simply wrap all accesses
> > with pm_runtime_get/put() put calls, which eliminates the problem of
> > maintaining a list of safe ioctls in low power.
> >   
> 
>  So wrap all access with pm_runtime_get()/put() will only be applicable
>  for IOCTLs. Correct ?
>  For config space, we can go with the approach discussed earlier in which
>  we return error ?

If we need to handle arbitrarily induced wakes from the host, it
doesn't make much sense to restrict those same sort of accesses by the
user through the vfio-device.  It also seems a lot easier to simply do
a pm_get/put() around not only ioctls, but all region accesses to avoid
the sorts of races you previously identified.  Access through mmap
should still arguably fault given that there is no discrete end to such
an access like we have for read/write operations.  Thanks,

Alex