Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state

Abhishek Sahu <abhsahu@xxxxxxxxxx> · Wed, 8 Jun 2022 15:42:43 +0530

On 6/8/2022 3:20 AM, Alex Williamson wrote:
> On Fri, 3 Jun 2022 15:49:27 +0530
> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
> 
>> On 6/2/2022 11:14 PM, Alex Williamson wrote:
>>> On Thu, 2 Jun 2022 17:22:03 +0530
>>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
>>>   
>>>> On 6/1/2022 9:51 PM, Alex Williamson wrote:  
>>>>> On Wed, 1 Jun 2022 15:19:07 +0530
>>>>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
>>>>>     
>>>>>> On 6/1/2022 4:22 AM, Alex Williamson wrote:    
>>>>>>> On Tue, 31 May 2022 16:43:04 -0300
>>>>>>> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>>>>>>>       
>>>>>>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:      
>>>>>>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:        
>>>>>>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
>>>>>>>>>>         
>>>>>>>>>>>  1. In real use case, config or any other ioctl should not come along
>>>>>>>>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
>>>>>>>>>>>  
>>>>>>>>>>>  2. Maintain some 'access_count' which will be incremented when we
>>>>>>>>>>>     do any config space access or ioctl.        
>>>>>>>>>>
>>>>>>>>>> Please don't open code locks - if you need a lock then write a proper
>>>>>>>>>> lock. You can use the 'try' variants to bail out in cases where that
>>>>>>>>>> is appropriate.
>>>>>>>>>>
>>>>>>>>>> Jason        
>>>>>>>>>
>>>>>>>>>  Thanks Jason for providing your inputs.
>>>>>>>>>
>>>>>>>>>  In that case, should I introduce new rw_semaphore (For example
>>>>>>>>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?        
>>>>>>>>
>>>>>>>> Possibly, this is better than an atomic at least
>>>>>>>>      
>>>>>>>>>  1. At the beginning of config space access or ioctl, we can take the
>>>>>>>>>     lock
>>>>>>>>>  
>>>>>>>>>      down_read(&vdev->power_lock);        
>>>>>>>>
>>>>>>>> You can also do down_read_trylock() here and bail out as you were
>>>>>>>> suggesting with the atomic.
>>>>>>>>
>>>>>>>> trylock doesn't have lock odering rules because it can't sleep so it
>>>>>>>> gives a bit more flexability when designing the lock ordering.
>>>>>>>>
>>>>>>>> Though userspace has to be able to tolerate the failure, or never make
>>>>>>>> the request.
>>>>>>>>      
>>>>>>
>>>>>>  Thanks Alex and Jason for providing your inputs.
>>>>>>
>>>>>>  Using down_read_trylock() along with Alex suggestion seems fine.
>>>>>>  In real use case, config space access should not happen when the
>>>>>>  device is in low power state so returning error should not
>>>>>>  cause any issue in this case.
>>>>>>    
>>>>>>>>>          down_write(&vdev->power_lock);
>>>>>>>>>          ...
>>>>>>>>>          switch (vfio_pm.low_power_state) {
>>>>>>>>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>>>>>>>>>                  ...
>>>>>>>>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
>>>>>>>>>                          vdev->power_state_d3 = true;
>>>>>>>>>                          up_write(&vdev->memory_lock);
>>>>>>>>>
>>>>>>>>>          ...
>>>>>>>>>          up_write(&vdev->power_lock);        
>>>>>>>>
>>>>>>>> And something checks the power lock before allowing the memor to be
>>>>>>>> re-enabled?
>>>>>>>>      
>>>>>>>>>  4.  For ioctl access, as mentioned previously I need to add two
>>>>>>>>>      callbacks functions (one for start and one for end) in the struct
>>>>>>>>>      vfio_device_ops and call the same at start and end of ioctl from
>>>>>>>>>      vfio_device_fops_unl_ioctl().        
>>>>>>>>
>>>>>>>> Not sure I followed this..      
>>>>>>>
>>>>>>> I'm kinda lost here too.      
>>>>>>
>>>>>>
>>>>>>  I have summarized the things below
>>>>>>
>>>>>>  1. In the current patch (v3 8/8), if config space access or ioctl was
>>>>>>     being made by the user when the device is already in low power state,
>>>>>>     then it was waking the device. This wake up was happening with
>>>>>>     pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
>>>>>>     vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).
>>>>>>
>>>>>>  2. Now, it has been decided to return error instead of waking the
>>>>>>     device if the device is already in low power state.
>>>>>>
>>>>>>  3. Initially I thought to add following code in config space path
>>>>>>     (and similar in ioctl)
>>>>>>
>>>>>>         vfio_pci_config_rw() {
>>>>>>             ...
>>>>>>             down_read(&vdev->memory_lock);
>>>>>>             if (vdev->platform_pm_engaged)
>>>>>>             {
>>>>>>                 up_read(&vdev->memory_lock);
>>>>>>                 return -EIO;
>>>>>>             }
>>>>>>             ...
>>>>>>         }
>>>>>>
>>>>>>      And then there was a possibility that the physical config happens
>>>>>>      when the device in D3cold in case of race condition.
>>>>>>
>>>>>>  4.  So, I wanted to add some mechanism so that the low power entry
>>>>>>      ioctl will be serialized with other ioctl or config space. With this
>>>>>>      if low power entry gets scheduled first then config/other ioctls will
>>>>>>      get failure, otherwise low power entry will wait.
>>>>>>
>>>>>>  5.  For serializing this access, I need to ensure that lock is held
>>>>>>      throughout the operation. For config space I can add the code in
>>>>>>      vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
>>>>>>      way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
>>>>>>      VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
>>>>>>      vfio core layer itself.
>>>>>>
>>>>>>  The memory_lock and the variables to track low power in specific to
>>>>>>  vfio-pci so I need some mechanism by which I add low power check for
>>>>>>  each ioctl. For serialization, I need to call function implemented in
>>>>>>  vfio-pci before vfio core layer makes the actual ioctl to grab the
>>>>>>  locks. Similarly, I need to release the lock once vfio core layer
>>>>>>  finished the actual ioctl. I have mentioned about this problem in the
>>>>>>  above point (point 4 in my earlier mail).
>>>>>>    
>>>>>>> A couple replies back there was some concern
>>>>>>> about race scenarios with multiple user threads accessing the device.
>>>>>>> The ones concerning non-deterministic behavior if a user is
>>>>>>> concurrently changing power state and performing other accesses are a
>>>>>>> non-issue, imo.        
>>>>>>
>>>>>>  What does non-deterministic behavior here mean.
>>>>>>  Is it for user side that user will see different result
>>>>>>  (failure or success) during race condition or in the kernel side
>>>>>>  (as explained in point 3 above where physical config access
>>>>>>  happens when the device in D3cold) ? My concern here is for later
>>>>>>  part where this config space access in D3cold can cause fatal error
>>>>>>  on the system side as we have seen for memory disablement.    
>>>>>
>>>>> Yes, our only concern should be to prevent such an access.  The user
>>>>> seeing non-deterministic behavior, such as during concurrent power
>>>>> control and config space access, all combinations of success/failure
>>>>> are possible, is par for the course when we decide to block accesses
>>>>> across the life of the low power state.
>>>>>      
>>>>>>> I think our goal is only to expand the current
>>>>>>> memory_lock to block accesses, including config space, while the device
>>>>>>> is in low power, or some approximation bounded by the entry/exit ioctl.
>>>>>>>
>>>>>>> I think the remaining issues is how to do that relative to the fact
>>>>>>> that config space access can change the memory enable state and would
>>>>>>> therefore need to upgrade the memory_lock read-lock to a write-lock.
>>>>>>> For that I think we can simply drop the read-lock, acquire the
>>>>>>> write-lock, and re-test the low power state.  If it has changed, that
>>>>>>> suggests the user has again raced changing power state with another
>>>>>>> access and we can simply drop the lock and return -EIO.
>>>>>>>       
>>>>>>
>>>>>>  Yes. This looks better option. So, just to confirm, I can take the
>>>>>>  memory_lock read-lock at the starting of vfio_pci_config_rw() and
>>>>>>  release it just before returning from vfio_pci_config_rw() and
>>>>>>  for memory related config access, we will release this lock and
>>>>>>  re-aquiring again write version of this. Once memory write happens,
>>>>>>  then we can downgrade this write lock to read lock ?    
>>>>>
>>>>> We only need to lock for the device access, so if you've finished that
>>>>> access after acquiring the write-lock, there'd be no point to then
>>>>> downgrade that to a read-lock.  The access should be finished by that
>>>>> point.
>>>>>    
>>>>
>>>>  I was planning to take memory_lock read-lock at the beginning of
>>>>  vfio_pci_config_rw() and release the same just before returning from
>>>>  this function. If I don't downgrade it back to read-lock, then the
>>>>  release in the end will be called for the lock which has not taken.
>>>>  Also, user can specify count to any number of bytes and then the
>>>>  vfio_config_do_rw() will be invoked multiple times and then in
>>>>  the second call, it will be without lock.  
>>>
>>> Ok, yes, I can imagine how it might result in a cleaner exit path to do
>>> a downgrade_write().
>>>   
>>>>>>  Also, what about IOCTLs. How can I take and release memory_lock for
>>>>>>  ioctl. is it okay to go with Patch 7 where we call
>>>>>>  pm_runtime_resume_and_get() before each ioctl or we need to do the
>>>>>>  same low power check for ioctl also ?
>>>>>>  In Later case, I am not sure how should I do the implementation so
>>>>>>  that all other ioctl are covered from vfio core layer itself.    
>>>>>
>>>>> Some ioctls clearly cannot occur while the device is in low power, such
>>>>> as resets and interrupt control, but even less obvious things like
>>>>> getting region info require device access.  Migration also provides a
>>>>> channel to device access.  Do we want to manage a list of ioctls that
>>>>> are allowed in low power, or do we only want to allow the ioctl to exit
>>>>> low power?
>>>>>     
>>>>
>>>>  In previous version of this patch, you mentioned that maintaining the
>>>>  safe ioctl list will be tough to maintain. So, currently we wanted to
>>>>  allow the ioctl for low power exit.  
>>>
>>> Yes, I'm still conflicted in how that would work.
>>>    
>>>>> I'm also still curious how we're going to handle devices that cannot
>>>>> return to low power such as the self-refresh mode on the GPU.  We can
>>>>> potentially prevent any wake-ups from the vfio device interface, but
>>>>> that doesn't preclude a wake-up via an external lspci.  I think we need
>>>>> to understand how we're going to handle such devices before we can
>>>>> really complete the design.  AIUI, we cannot disable the self-refresh
>>>>> sleep mode without imposing unreasonable latency and memory
>>>>> requirements on the guest and we cannot retrigger the self-refresh
>>>>> low-power mode without non-trivial device specific code.  Thanks,
>>>>>
>>>>> Alex
>>>>>     
>>>>
>>>>  I am working on adding support to notify guest through virtual PME
>>>>  whenever there is any wake-up triggered by the host and the guest has
>>>>  already put the device into runtime suspended state. This virtual PME
>>>>  will be similar to physical PME. Normally, if PCI device need power
>>>>  management transition, then it sends PME event which will be
>>>>  ultimately handled by host OS. In virtual PME case, if host need power
>>>>  management transition, then it sends event to guest and then guest OS
>>>>  handles these virtual PME events. Following is summary:
>>>>
>>>>  1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX
>>>>     named VFIO_PCI_PME_IRQ_INDEX and add the required code for this
>>>>     virtual PME event.
>>>>
>>>>  2. From the guest side, when the PME_IRQ is enabled then we will
>>>>     set event_fd for PME.
>>>>
>>>>  3. In the vfio driver, the PME support bits are already
>>>>     virtualized and currently set to 0. We can set PME capability support
>>>>     for D3cold so that in guest, it looks like
>>>>
>>>>      Capabilities: [60] Power Management version 3
>>>>      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
>>>>             PME(D0-,D1-,D2-,D3hot-,D3cold+)
>>>>
>>>>  4. From the guest side, it can do PME enable (PME_En bit in Power
>>>>     Management Control/Status Register) which will be again virtualized.
>>>>
>>>>  5. When host gets request for resuming the device other than from
>>>>     low power ioctl, then device pm usage count will be incremented, the
>>>>     PME status (PME_Status bit in Power Management Control/Status Register)
>>>>     will be set and then we can do the event_fd signal.
>>>>
>>>>  6. In the PCIe, the PME events will be handled by root port. For
>>>>     using low power D3cold feature, it is required to create virtual root
>>>>     port in hypervisor side and when hypervisor receives this PME event,
>>>>     then it can send virtual interrupt to root port.
>>>>
>>>>  7. If we take example of Linux kernel, then pcie_pme_irq() will
>>>>     handle this and then do the runtime resume on the guest side. Also, it
>>>>     will clear the PME status bit here. Then guest can put the device
>>>>     again into suspended state.
>>>>
>>>>  8. I did prototype changes in QEMU for above logic and was getting wake-up
>>>>     in the guest whenever I do lspci on the host side.
>>>>
>>>>  9. Since currently only nvidia GPU has this limitation to require
>>>>     driver interaction each time before going into D3cold so we can allow
>>>>     the reentry for other device. We can have nvidia vendor (along with
>>>>     VGA/3D controller class code). In future, if any other device also has
>>>>     similar requirement then we can update this list. For other device
>>>>     host can put the device into D3cold in case of any wake-up.
>>>>
>>>>  10. In the vfio driver, we can put all these restriction for
>>>>      enabling PME and return error if user tries to make low power entry
>>>>      ioctl without enabling the PME related things.
>>>>
>>>>  11. The virtual PME can help in handling physical PME also for all
>>>>      the devices. The PME logic is not dependent upon nvidia GPU
>>>>      restriction. If virtual PME is enabled by hypervisor, then when
>>>>      physical PME wakes the device, then it will resume on the guest side
>>>>      also.  
>>>
>>> So if host accesses through things like lspci are going to wake the
>>> device and we can't prevent that, and the solution to that is to notify
>>> the guest to put the device back to low power, then it seems a lot less
>>> important to try to prevent the user from waking the device through
>>> random accesses.  In that context, maybe we do simply wrap all accesses
>>> with pm_runtime_get/put() put calls, which eliminates the problem of
>>> maintaining a list of safe ioctls in low power.
>>>   
>>
>>  So wrap all access with pm_runtime_get()/put() will only be applicable
>>  for IOCTLs. Correct ?
>>  For config space, we can go with the approach discussed earlier in which
>>  we return error ?
> 
> If we need to handle arbitrarily induced wakes from the host, it
> doesn't make much sense to restrict those same sort of accesses by the
> user through the vfio-device.  It also seems a lot easier to simply do
> a pm_get/put() around not only ioctls, but all region accesses to avoid
> the sorts of races you previously identified.  Access through mmap
> should still arguably fault given that there is no discrete end to such
> an access like we have for read/write operations.  Thanks,
> 
> Alex
> 

 Thanks Alex for confirming.
 I will do the same.

 Regards,
 Abhishek