Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state

Abhishek Sahu <abhsahu@xxxxxxxxxx> · Fri, 3 Jun 2022 15:49:27 +0530

On 6/2/2022 11:14 PM, Alex Williamson wrote:
> On Thu, 2 Jun 2022 17:22:03 +0530
> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
> 
>> On 6/1/2022 9:51 PM, Alex Williamson wrote:
>>> On Wed, 1 Jun 2022 15:19:07 +0530
>>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
>>>   
>>>> On 6/1/2022 4:22 AM, Alex Williamson wrote:  
>>>>> On Tue, 31 May 2022 16:43:04 -0300
>>>>> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>>>>>     
>>>>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:    
>>>>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:      
>>>>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
>>>>>>>>       
>>>>>>>>>  1. In real use case, config or any other ioctl should not come along
>>>>>>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
>>>>>>>>>  
>>>>>>>>>  2. Maintain some 'access_count' which will be incremented when we
>>>>>>>>>     do any config space access or ioctl.      
>>>>>>>>
>>>>>>>> Please don't open code locks - if you need a lock then write a proper
>>>>>>>> lock. You can use the 'try' variants to bail out in cases where that
>>>>>>>> is appropriate.
>>>>>>>>
>>>>>>>> Jason      
>>>>>>>
>>>>>>>  Thanks Jason for providing your inputs.
>>>>>>>
>>>>>>>  In that case, should I introduce new rw_semaphore (For example
>>>>>>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?      
>>>>>>
>>>>>> Possibly, this is better than an atomic at least
>>>>>>    
>>>>>>>  1. At the beginning of config space access or ioctl, we can take the
>>>>>>>     lock
>>>>>>>  
>>>>>>>      down_read(&vdev->power_lock);      
>>>>>>
>>>>>> You can also do down_read_trylock() here and bail out as you were
>>>>>> suggesting with the atomic.
>>>>>>
>>>>>> trylock doesn't have lock odering rules because it can't sleep so it
>>>>>> gives a bit more flexability when designing the lock ordering.
>>>>>>
>>>>>> Though userspace has to be able to tolerate the failure, or never make
>>>>>> the request.
>>>>>>    
>>>>
>>>>  Thanks Alex and Jason for providing your inputs.
>>>>
>>>>  Using down_read_trylock() along with Alex suggestion seems fine.
>>>>  In real use case, config space access should not happen when the
>>>>  device is in low power state so returning error should not
>>>>  cause any issue in this case.
>>>>  
>>>>>>>          down_write(&vdev->power_lock);
>>>>>>>          ...
>>>>>>>          switch (vfio_pm.low_power_state) {
>>>>>>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>>>>>>>                  ...
>>>>>>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
>>>>>>>                          vdev->power_state_d3 = true;
>>>>>>>                          up_write(&vdev->memory_lock);
>>>>>>>
>>>>>>>          ...
>>>>>>>          up_write(&vdev->power_lock);      
>>>>>>
>>>>>> And something checks the power lock before allowing the memor to be
>>>>>> re-enabled?
>>>>>>    
>>>>>>>  4.  For ioctl access, as mentioned previously I need to add two
>>>>>>>      callbacks functions (one for start and one for end) in the struct
>>>>>>>      vfio_device_ops and call the same at start and end of ioctl from
>>>>>>>      vfio_device_fops_unl_ioctl().      
>>>>>>
>>>>>> Not sure I followed this..    
>>>>>
>>>>> I'm kinda lost here too.    
>>>>
>>>>
>>>>  I have summarized the things below
>>>>
>>>>  1. In the current patch (v3 8/8), if config space access or ioctl was
>>>>     being made by the user when the device is already in low power state,
>>>>     then it was waking the device. This wake up was happening with
>>>>     pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
>>>>     vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).
>>>>
>>>>  2. Now, it has been decided to return error instead of waking the
>>>>     device if the device is already in low power state.
>>>>
>>>>  3. Initially I thought to add following code in config space path
>>>>     (and similar in ioctl)
>>>>
>>>>         vfio_pci_config_rw() {
>>>>             ...
>>>>             down_read(&vdev->memory_lock);
>>>>             if (vdev->platform_pm_engaged)
>>>>             {
>>>>                 up_read(&vdev->memory_lock);
>>>>                 return -EIO;
>>>>             }
>>>>             ...
>>>>         }
>>>>
>>>>      And then there was a possibility that the physical config happens
>>>>      when the device in D3cold in case of race condition.
>>>>
>>>>  4.  So, I wanted to add some mechanism so that the low power entry
>>>>      ioctl will be serialized with other ioctl or config space. With this
>>>>      if low power entry gets scheduled first then config/other ioctls will
>>>>      get failure, otherwise low power entry will wait.
>>>>
>>>>  5.  For serializing this access, I need to ensure that lock is held
>>>>      throughout the operation. For config space I can add the code in
>>>>      vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
>>>>      way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
>>>>      VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
>>>>      vfio core layer itself.
>>>>
>>>>  The memory_lock and the variables to track low power in specific to
>>>>  vfio-pci so I need some mechanism by which I add low power check for
>>>>  each ioctl. For serialization, I need to call function implemented in
>>>>  vfio-pci before vfio core layer makes the actual ioctl to grab the
>>>>  locks. Similarly, I need to release the lock once vfio core layer
>>>>  finished the actual ioctl. I have mentioned about this problem in the
>>>>  above point (point 4 in my earlier mail).
>>>>  
>>>>> A couple replies back there was some concern
>>>>> about race scenarios with multiple user threads accessing the device.
>>>>> The ones concerning non-deterministic behavior if a user is
>>>>> concurrently changing power state and performing other accesses are a
>>>>> non-issue, imo.      
>>>>
>>>>  What does non-deterministic behavior here mean.
>>>>  Is it for user side that user will see different result
>>>>  (failure or success) during race condition or in the kernel side
>>>>  (as explained in point 3 above where physical config access
>>>>  happens when the device in D3cold) ? My concern here is for later
>>>>  part where this config space access in D3cold can cause fatal error
>>>>  on the system side as we have seen for memory disablement.  
>>>
>>> Yes, our only concern should be to prevent such an access.  The user
>>> seeing non-deterministic behavior, such as during concurrent power
>>> control and config space access, all combinations of success/failure
>>> are possible, is par for the course when we decide to block accesses
>>> across the life of the low power state.
>>>    
>>>>> I think our goal is only to expand the current
>>>>> memory_lock to block accesses, including config space, while the device
>>>>> is in low power, or some approximation bounded by the entry/exit ioctl.
>>>>>
>>>>> I think the remaining issues is how to do that relative to the fact
>>>>> that config space access can change the memory enable state and would
>>>>> therefore need to upgrade the memory_lock read-lock to a write-lock.
>>>>> For that I think we can simply drop the read-lock, acquire the
>>>>> write-lock, and re-test the low power state.  If it has changed, that
>>>>> suggests the user has again raced changing power state with another
>>>>> access and we can simply drop the lock and return -EIO.
>>>>>     
>>>>
>>>>  Yes. This looks better option. So, just to confirm, I can take the
>>>>  memory_lock read-lock at the starting of vfio_pci_config_rw() and
>>>>  release it just before returning from vfio_pci_config_rw() and
>>>>  for memory related config access, we will release this lock and
>>>>  re-aquiring again write version of this. Once memory write happens,
>>>>  then we can downgrade this write lock to read lock ?  
>>>
>>> We only need to lock for the device access, so if you've finished that
>>> access after acquiring the write-lock, there'd be no point to then
>>> downgrade that to a read-lock.  The access should be finished by that
>>> point.
>>>  
>>
>>  I was planning to take memory_lock read-lock at the beginning of
>>  vfio_pci_config_rw() and release the same just before returning from
>>  this function. If I don't downgrade it back to read-lock, then the
>>  release in the end will be called for the lock which has not taken.
>>  Also, user can specify count to any number of bytes and then the
>>  vfio_config_do_rw() will be invoked multiple times and then in
>>  the second call, it will be without lock.
> 
> Ok, yes, I can imagine how it might result in a cleaner exit path to do
> a downgrade_write().
> 
>>>>  Also, what about IOCTLs. How can I take and release memory_lock for
>>>>  ioctl. is it okay to go with Patch 7 where we call
>>>>  pm_runtime_resume_and_get() before each ioctl or we need to do the
>>>>  same low power check for ioctl also ?
>>>>  In Later case, I am not sure how should I do the implementation so
>>>>  that all other ioctl are covered from vfio core layer itself.  
>>>
>>> Some ioctls clearly cannot occur while the device is in low power, such
>>> as resets and interrupt control, but even less obvious things like
>>> getting region info require device access.  Migration also provides a
>>> channel to device access.  Do we want to manage a list of ioctls that
>>> are allowed in low power, or do we only want to allow the ioctl to exit
>>> low power?
>>>   
>>
>>  In previous version of this patch, you mentioned that maintaining the
>>  safe ioctl list will be tough to maintain. So, currently we wanted to
>>  allow the ioctl for low power exit.
> 
> Yes, I'm still conflicted in how that would work.
>  
>>> I'm also still curious how we're going to handle devices that cannot
>>> return to low power such as the self-refresh mode on the GPU.  We can
>>> potentially prevent any wake-ups from the vfio device interface, but
>>> that doesn't preclude a wake-up via an external lspci.  I think we need
>>> to understand how we're going to handle such devices before we can
>>> really complete the design.  AIUI, we cannot disable the self-refresh
>>> sleep mode without imposing unreasonable latency and memory
>>> requirements on the guest and we cannot retrigger the self-refresh
>>> low-power mode without non-trivial device specific code.  Thanks,
>>>
>>> Alex
>>>   
>>
>>  I am working on adding support to notify guest through virtual PME
>>  whenever there is any wake-up triggered by the host and the guest has
>>  already put the device into runtime suspended state. This virtual PME
>>  will be similar to physical PME. Normally, if PCI device need power
>>  management transition, then it sends PME event which will be
>>  ultimately handled by host OS. In virtual PME case, if host need power
>>  management transition, then it sends event to guest and then guest OS
>>  handles these virtual PME events. Following is summary:
>>
>>  1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX
>>     named VFIO_PCI_PME_IRQ_INDEX and add the required code for this
>>     virtual PME event.
>>
>>  2. From the guest side, when the PME_IRQ is enabled then we will
>>     set event_fd for PME.
>>
>>  3. In the vfio driver, the PME support bits are already
>>     virtualized and currently set to 0. We can set PME capability support
>>     for D3cold so that in guest, it looks like
>>
>>      Capabilities: [60] Power Management version 3
>>      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
>>             PME(D0-,D1-,D2-,D3hot-,D3cold+)
>>
>>  4. From the guest side, it can do PME enable (PME_En bit in Power
>>     Management Control/Status Register) which will be again virtualized.
>>
>>  5. When host gets request for resuming the device other than from
>>     low power ioctl, then device pm usage count will be incremented, the
>>     PME status (PME_Status bit in Power Management Control/Status Register)
>>     will be set and then we can do the event_fd signal.
>>
>>  6. In the PCIe, the PME events will be handled by root port. For
>>     using low power D3cold feature, it is required to create virtual root
>>     port in hypervisor side and when hypervisor receives this PME event,
>>     then it can send virtual interrupt to root port.
>>
>>  7. If we take example of Linux kernel, then pcie_pme_irq() will
>>     handle this and then do the runtime resume on the guest side. Also, it
>>     will clear the PME status bit here. Then guest can put the device
>>     again into suspended state.
>>
>>  8. I did prototype changes in QEMU for above logic and was getting wake-up
>>     in the guest whenever I do lspci on the host side.
>>
>>  9. Since currently only nvidia GPU has this limitation to require
>>     driver interaction each time before going into D3cold so we can allow
>>     the reentry for other device. We can have nvidia vendor (along with
>>     VGA/3D controller class code). In future, if any other device also has
>>     similar requirement then we can update this list. For other device
>>     host can put the device into D3cold in case of any wake-up.
>>
>>  10. In the vfio driver, we can put all these restriction for
>>      enabling PME and return error if user tries to make low power entry
>>      ioctl without enabling the PME related things.
>>
>>  11. The virtual PME can help in handling physical PME also for all
>>      the devices. The PME logic is not dependent upon nvidia GPU
>>      restriction. If virtual PME is enabled by hypervisor, then when
>>      physical PME wakes the device, then it will resume on the guest side
>>      also.
> 
> So if host accesses through things like lspci are going to wake the
> device and we can't prevent that, and the solution to that is to notify
> the guest to put the device back to low power, then it seems a lot less
> important to try to prevent the user from waking the device through
> random accesses.  In that context, maybe we do simply wrap all accesses
> with pm_runtime_get/put() put calls, which eliminates the problem of
> maintaining a list of safe ioctls in low power.
> 

 So wrap all access with pm_runtime_get()/put() will only be applicable
 for IOCTLs. Correct ?
 For config space, we can go with the approach discussed earlier in which
 we return error ?

> I'd probably argue that whether to allow the kernel to put the device
> back to low power directly is a policy decision and should therefore be
> directed by userspace.  For example the low power entry ioctl would
> have a flag to indicate the desired behavior and QEMU might have an
> on/off/[auto] vfio-pci device option which allows configuration of that
> behavior.  The default auto policy might direct for automatic low-power
> re-entry except for NVIDIA VGA/3D class codes and other devices we
> discover that need it.  This lets us have an immediate workaround for
> devices requiring guest support without a new kernel.
> 

 Yes. That is better option.
 I will do the changes.

> This PME notification to the guest is really something that needs to be
> part of the base specification for user managed low power access due to
> these sorts of design decisions.  Thanks,
> 
> Alex
> 

 Yes. I will include this in my next patch series.

 Regards,
 Abhishek