Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/1/2022 4:22 AM, Alex Williamson wrote:
> On Tue, 31 May 2022 16:43:04 -0300
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> 
>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:
>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:  
>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
>>>>   
>>>>>  1. In real use case, config or any other ioctl should not come along
>>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
>>>>>  
>>>>>  2. Maintain some 'access_count' which will be incremented when we
>>>>>     do any config space access or ioctl.  
>>>>
>>>> Please don't open code locks - if you need a lock then write a proper
>>>> lock. You can use the 'try' variants to bail out in cases where that
>>>> is appropriate.
>>>>
>>>> Jason  
>>>
>>>  Thanks Jason for providing your inputs.
>>>
>>>  In that case, should I introduce new rw_semaphore (For example
>>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?  
>>
>> Possibly, this is better than an atomic at least
>>
>>>  1. At the beginning of config space access or ioctl, we can take the
>>>     lock
>>>  
>>>      down_read(&vdev->power_lock);  
>>
>> You can also do down_read_trylock() here and bail out as you were
>> suggesting with the atomic.
>>
>> trylock doesn't have lock odering rules because it can't sleep so it
>> gives a bit more flexability when designing the lock ordering.
>>
>> Though userspace has to be able to tolerate the failure, or never make
>> the request.
>>

 Thanks Alex and Jason for providing your inputs.

 Using down_read_trylock() along with Alex suggestion seems fine.
 In real use case, config space access should not happen when the
 device is in low power state so returning error should not
 cause any issue in this case.

>>>          down_write(&vdev->power_lock);
>>>          ...
>>>          switch (vfio_pm.low_power_state) {
>>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>>>                  ...
>>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
>>>                          vdev->power_state_d3 = true;
>>>                          up_write(&vdev->memory_lock);
>>>
>>>          ...
>>>          up_write(&vdev->power_lock);  
>>
>> And something checks the power lock before allowing the memor to be
>> re-enabled?
>>
>>>  4.  For ioctl access, as mentioned previously I need to add two
>>>      callbacks functions (one for start and one for end) in the struct
>>>      vfio_device_ops and call the same at start and end of ioctl from
>>>      vfio_device_fops_unl_ioctl().  
>>
>> Not sure I followed this..
> 
> I'm kinda lost here too.


 I have summarized the things below

 1. In the current patch (v3 8/8), if config space access or ioctl was
    being made by the user when the device is already in low power state,
    then it was waking the device. This wake up was happening with
    pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
    vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).

 2. Now, it has been decided to return error instead of waking the
    device if the device is already in low power state.

 3. Initially I thought to add following code in config space path
    (and similar in ioctl)

        vfio_pci_config_rw() {
            ...
            down_read(&vdev->memory_lock);
            if (vdev->platform_pm_engaged)
            {
                up_read(&vdev->memory_lock);
                return -EIO;
            }
            ...
        }

     And then there was a possibility that the physical config happens
     when the device in D3cold in case of race condition.

 4.  So, I wanted to add some mechanism so that the low power entry
     ioctl will be serialized with other ioctl or config space. With this
     if low power entry gets scheduled first then config/other ioctls will
     get failure, otherwise low power entry will wait.

 5.  For serializing this access, I need to ensure that lock is held
     throughout the operation. For config space I can add the code in
     vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
     way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
     VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
     vfio core layer itself.

 The memory_lock and the variables to track low power in specific to
 vfio-pci so I need some mechanism by which I add low power check for
 each ioctl. For serialization, I need to call function implemented in
 vfio-pci before vfio core layer makes the actual ioctl to grab the
 locks. Similarly, I need to release the lock once vfio core layer
 finished the actual ioctl. I have mentioned about this problem in the
 above point (point 4 in my earlier mail).

> A couple replies back there was some concern
> about race scenarios with multiple user threads accessing the device.
> The ones concerning non-deterministic behavior if a user is
> concurrently changing power state and performing other accesses are a
> non-issue, imo.  

 What does non-deterministic behavior here mean.
 Is it for user side that user will see different result
 (failure or success) during race condition or in the kernel side
 (as explained in point 3 above where physical config access
 happens when the device in D3cold) ? My concern here is for later
 part where this config space access in D3cold can cause fatal error
 on the system side as we have seen for memory disablement.

> I think our goal is only to expand the current
> memory_lock to block accesses, including config space, while the device
> is in low power, or some approximation bounded by the entry/exit ioctl.
> 
> I think the remaining issues is how to do that relative to the fact
> that config space access can change the memory enable state and would
> therefore need to upgrade the memory_lock read-lock to a write-lock.
> For that I think we can simply drop the read-lock, acquire the
> write-lock, and re-test the low power state.  If it has changed, that
> suggests the user has again raced changing power state with another
> access and we can simply drop the lock and return -EIO.
> 

 Yes. This looks better option. So, just to confirm, I can take the
 memory_lock read-lock at the starting of vfio_pci_config_rw() and
 release it just before returning from vfio_pci_config_rw() and
 for memory related config access, we will release this lock and
 re-aquiring again write version of this. Once memory write happens,
 then we can downgrade this write lock to read lock ?

 Also, what about IOCTLs. How can I take and release memory_lock for
 ioctl. is it okay to go with Patch 7 where we call
 pm_runtime_resume_and_get() before each ioctl or we need to do the
 same low power check for ioctl also ?
 In Later case, I am not sure how should I do the implementation so
 that all other ioctl are covered from vfio core layer itself.

 Thanks,
 Abhishek

> If I'm still misunderstanding, please let me know.  Thanks,
> 
> Alex
> 




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux