On 6/8/2022 3:20 AM, Alex Williamson wrote: > On Fri, 3 Jun 2022 15:49:27 +0530 > Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > >> On 6/2/2022 11:14 PM, Alex Williamson wrote: >>> On Thu, 2 Jun 2022 17:22:03 +0530 >>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: >>> >>>> On 6/1/2022 9:51 PM, Alex Williamson wrote: >>>>> On Wed, 1 Jun 2022 15:19:07 +0530 >>>>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: >>>>> >>>>>> On 6/1/2022 4:22 AM, Alex Williamson wrote: >>>>>>> On Tue, 31 May 2022 16:43:04 -0300 >>>>>>> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: >>>>>>> >>>>>>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote: >>>>>>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote: >>>>>>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote: >>>>>>>>>> >>>>>>>>>>> 1. In real use case, config or any other ioctl should not come along >>>>>>>>>>> with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request. >>>>>>>>>>> >>>>>>>>>>> 2. Maintain some 'access_count' which will be incremented when we >>>>>>>>>>> do any config space access or ioctl. >>>>>>>>>> >>>>>>>>>> Please don't open code locks - if you need a lock then write a proper >>>>>>>>>> lock. You can use the 'try' variants to bail out in cases where that >>>>>>>>>> is appropriate. >>>>>>>>>> >>>>>>>>>> Jason >>>>>>>>> >>>>>>>>> Thanks Jason for providing your inputs. >>>>>>>>> >>>>>>>>> In that case, should I introduce new rw_semaphore (For example >>>>>>>>> power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ? >>>>>>>> >>>>>>>> Possibly, this is better than an atomic at least >>>>>>>> >>>>>>>>> 1. At the beginning of config space access or ioctl, we can take the >>>>>>>>> lock >>>>>>>>> >>>>>>>>> down_read(&vdev->power_lock); >>>>>>>> >>>>>>>> You can also do down_read_trylock() here and bail out as you were >>>>>>>> suggesting with the atomic. >>>>>>>> >>>>>>>> trylock doesn't have lock odering rules because it can't sleep so it >>>>>>>> gives a bit more flexability when designing the lock ordering. >>>>>>>> >>>>>>>> Though userspace has to be able to tolerate the failure, or never make >>>>>>>> the request. >>>>>>>> >>>>>> >>>>>> Thanks Alex and Jason for providing your inputs. >>>>>> >>>>>> Using down_read_trylock() along with Alex suggestion seems fine. >>>>>> In real use case, config space access should not happen when the >>>>>> device is in low power state so returning error should not >>>>>> cause any issue in this case. >>>>>> >>>>>>>>> down_write(&vdev->power_lock); >>>>>>>>> ... >>>>>>>>> switch (vfio_pm.low_power_state) { >>>>>>>>> case VFIO_DEVICE_LOW_POWER_STATE_ENTER: >>>>>>>>> ... >>>>>>>>> vfio_pci_zap_and_down_write_memory_lock(vdev); >>>>>>>>> vdev->power_state_d3 = true; >>>>>>>>> up_write(&vdev->memory_lock); >>>>>>>>> >>>>>>>>> ... >>>>>>>>> up_write(&vdev->power_lock); >>>>>>>> >>>>>>>> And something checks the power lock before allowing the memor to be >>>>>>>> re-enabled? >>>>>>>> >>>>>>>>> 4. For ioctl access, as mentioned previously I need to add two >>>>>>>>> callbacks functions (one for start and one for end) in the struct >>>>>>>>> vfio_device_ops and call the same at start and end of ioctl from >>>>>>>>> vfio_device_fops_unl_ioctl(). >>>>>>>> >>>>>>>> Not sure I followed this.. >>>>>>> >>>>>>> I'm kinda lost here too. >>>>>> >>>>>> >>>>>> I have summarized the things below >>>>>> >>>>>> 1. In the current patch (v3 8/8), if config space access or ioctl was >>>>>> being made by the user when the device is already in low power state, >>>>>> then it was waking the device. This wake up was happening with >>>>>> pm_runtime_resume_and_get() API in vfio_pci_config_rw() and >>>>>> vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series). >>>>>> >>>>>> 2. Now, it has been decided to return error instead of waking the >>>>>> device if the device is already in low power state. >>>>>> >>>>>> 3. Initially I thought to add following code in config space path >>>>>> (and similar in ioctl) >>>>>> >>>>>> vfio_pci_config_rw() { >>>>>> ... >>>>>> down_read(&vdev->memory_lock); >>>>>> if (vdev->platform_pm_engaged) >>>>>> { >>>>>> up_read(&vdev->memory_lock); >>>>>> return -EIO; >>>>>> } >>>>>> ... >>>>>> } >>>>>> >>>>>> And then there was a possibility that the physical config happens >>>>>> when the device in D3cold in case of race condition. >>>>>> >>>>>> 4. So, I wanted to add some mechanism so that the low power entry >>>>>> ioctl will be serialized with other ioctl or config space. With this >>>>>> if low power entry gets scheduled first then config/other ioctls will >>>>>> get failure, otherwise low power entry will wait. >>>>>> >>>>>> 5. For serializing this access, I need to ensure that lock is held >>>>>> throughout the operation. For config space I can add the code in >>>>>> vfio_pci_config_rw(). But for ioctls, I was not sure what is the best >>>>>> way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION, >>>>>> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the >>>>>> vfio core layer itself. >>>>>> >>>>>> The memory_lock and the variables to track low power in specific to >>>>>> vfio-pci so I need some mechanism by which I add low power check for >>>>>> each ioctl. For serialization, I need to call function implemented in >>>>>> vfio-pci before vfio core layer makes the actual ioctl to grab the >>>>>> locks. Similarly, I need to release the lock once vfio core layer >>>>>> finished the actual ioctl. I have mentioned about this problem in the >>>>>> above point (point 4 in my earlier mail). >>>>>> >>>>>>> A couple replies back there was some concern >>>>>>> about race scenarios with multiple user threads accessing the device. >>>>>>> The ones concerning non-deterministic behavior if a user is >>>>>>> concurrently changing power state and performing other accesses are a >>>>>>> non-issue, imo. >>>>>> >>>>>> What does non-deterministic behavior here mean. >>>>>> Is it for user side that user will see different result >>>>>> (failure or success) during race condition or in the kernel side >>>>>> (as explained in point 3 above where physical config access >>>>>> happens when the device in D3cold) ? My concern here is for later >>>>>> part where this config space access in D3cold can cause fatal error >>>>>> on the system side as we have seen for memory disablement. >>>>> >>>>> Yes, our only concern should be to prevent such an access. The user >>>>> seeing non-deterministic behavior, such as during concurrent power >>>>> control and config space access, all combinations of success/failure >>>>> are possible, is par for the course when we decide to block accesses >>>>> across the life of the low power state. >>>>> >>>>>>> I think our goal is only to expand the current >>>>>>> memory_lock to block accesses, including config space, while the device >>>>>>> is in low power, or some approximation bounded by the entry/exit ioctl. >>>>>>> >>>>>>> I think the remaining issues is how to do that relative to the fact >>>>>>> that config space access can change the memory enable state and would >>>>>>> therefore need to upgrade the memory_lock read-lock to a write-lock. >>>>>>> For that I think we can simply drop the read-lock, acquire the >>>>>>> write-lock, and re-test the low power state. If it has changed, that >>>>>>> suggests the user has again raced changing power state with another >>>>>>> access and we can simply drop the lock and return -EIO. >>>>>>> >>>>>> >>>>>> Yes. This looks better option. So, just to confirm, I can take the >>>>>> memory_lock read-lock at the starting of vfio_pci_config_rw() and >>>>>> release it just before returning from vfio_pci_config_rw() and >>>>>> for memory related config access, we will release this lock and >>>>>> re-aquiring again write version of this. Once memory write happens, >>>>>> then we can downgrade this write lock to read lock ? >>>>> >>>>> We only need to lock for the device access, so if you've finished that >>>>> access after acquiring the write-lock, there'd be no point to then >>>>> downgrade that to a read-lock. The access should be finished by that >>>>> point. >>>>> >>>> >>>> I was planning to take memory_lock read-lock at the beginning of >>>> vfio_pci_config_rw() and release the same just before returning from >>>> this function. If I don't downgrade it back to read-lock, then the >>>> release in the end will be called for the lock which has not taken. >>>> Also, user can specify count to any number of bytes and then the >>>> vfio_config_do_rw() will be invoked multiple times and then in >>>> the second call, it will be without lock. >>> >>> Ok, yes, I can imagine how it might result in a cleaner exit path to do >>> a downgrade_write(). >>> >>>>>> Also, what about IOCTLs. How can I take and release memory_lock for >>>>>> ioctl. is it okay to go with Patch 7 where we call >>>>>> pm_runtime_resume_and_get() before each ioctl or we need to do the >>>>>> same low power check for ioctl also ? >>>>>> In Later case, I am not sure how should I do the implementation so >>>>>> that all other ioctl are covered from vfio core layer itself. >>>>> >>>>> Some ioctls clearly cannot occur while the device is in low power, such >>>>> as resets and interrupt control, but even less obvious things like >>>>> getting region info require device access. Migration also provides a >>>>> channel to device access. Do we want to manage a list of ioctls that >>>>> are allowed in low power, or do we only want to allow the ioctl to exit >>>>> low power? >>>>> >>>> >>>> In previous version of this patch, you mentioned that maintaining the >>>> safe ioctl list will be tough to maintain. So, currently we wanted to >>>> allow the ioctl for low power exit. >>> >>> Yes, I'm still conflicted in how that would work. >>> >>>>> I'm also still curious how we're going to handle devices that cannot >>>>> return to low power such as the self-refresh mode on the GPU. We can >>>>> potentially prevent any wake-ups from the vfio device interface, but >>>>> that doesn't preclude a wake-up via an external lspci. I think we need >>>>> to understand how we're going to handle such devices before we can >>>>> really complete the design. AIUI, we cannot disable the self-refresh >>>>> sleep mode without imposing unreasonable latency and memory >>>>> requirements on the guest and we cannot retrigger the self-refresh >>>>> low-power mode without non-trivial device specific code. Thanks, >>>>> >>>>> Alex >>>>> >>>> >>>> I am working on adding support to notify guest through virtual PME >>>> whenever there is any wake-up triggered by the host and the guest has >>>> already put the device into runtime suspended state. This virtual PME >>>> will be similar to physical PME. Normally, if PCI device need power >>>> management transition, then it sends PME event which will be >>>> ultimately handled by host OS. In virtual PME case, if host need power >>>> management transition, then it sends event to guest and then guest OS >>>> handles these virtual PME events. Following is summary: >>>> >>>> 1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX >>>> named VFIO_PCI_PME_IRQ_INDEX and add the required code for this >>>> virtual PME event. >>>> >>>> 2. From the guest side, when the PME_IRQ is enabled then we will >>>> set event_fd for PME. >>>> >>>> 3. In the vfio driver, the PME support bits are already >>>> virtualized and currently set to 0. We can set PME capability support >>>> for D3cold so that in guest, it looks like >>>> >>>> Capabilities: [60] Power Management version 3 >>>> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA >>>> PME(D0-,D1-,D2-,D3hot-,D3cold+) >>>> >>>> 4. From the guest side, it can do PME enable (PME_En bit in Power >>>> Management Control/Status Register) which will be again virtualized. >>>> >>>> 5. When host gets request for resuming the device other than from >>>> low power ioctl, then device pm usage count will be incremented, the >>>> PME status (PME_Status bit in Power Management Control/Status Register) >>>> will be set and then we can do the event_fd signal. >>>> >>>> 6. In the PCIe, the PME events will be handled by root port. For >>>> using low power D3cold feature, it is required to create virtual root >>>> port in hypervisor side and when hypervisor receives this PME event, >>>> then it can send virtual interrupt to root port. >>>> >>>> 7. If we take example of Linux kernel, then pcie_pme_irq() will >>>> handle this and then do the runtime resume on the guest side. Also, it >>>> will clear the PME status bit here. Then guest can put the device >>>> again into suspended state. >>>> >>>> 8. I did prototype changes in QEMU for above logic and was getting wake-up >>>> in the guest whenever I do lspci on the host side. >>>> >>>> 9. Since currently only nvidia GPU has this limitation to require >>>> driver interaction each time before going into D3cold so we can allow >>>> the reentry for other device. We can have nvidia vendor (along with >>>> VGA/3D controller class code). In future, if any other device also has >>>> similar requirement then we can update this list. For other device >>>> host can put the device into D3cold in case of any wake-up. >>>> >>>> 10. In the vfio driver, we can put all these restriction for >>>> enabling PME and return error if user tries to make low power entry >>>> ioctl without enabling the PME related things. >>>> >>>> 11. The virtual PME can help in handling physical PME also for all >>>> the devices. The PME logic is not dependent upon nvidia GPU >>>> restriction. If virtual PME is enabled by hypervisor, then when >>>> physical PME wakes the device, then it will resume on the guest side >>>> also. >>> >>> So if host accesses through things like lspci are going to wake the >>> device and we can't prevent that, and the solution to that is to notify >>> the guest to put the device back to low power, then it seems a lot less >>> important to try to prevent the user from waking the device through >>> random accesses. In that context, maybe we do simply wrap all accesses >>> with pm_runtime_get/put() put calls, which eliminates the problem of >>> maintaining a list of safe ioctls in low power. >>> >> >> So wrap all access with pm_runtime_get()/put() will only be applicable >> for IOCTLs. Correct ? >> For config space, we can go with the approach discussed earlier in which >> we return error ? > > If we need to handle arbitrarily induced wakes from the host, it > doesn't make much sense to restrict those same sort of accesses by the > user through the vfio-device. It also seems a lot easier to simply do > a pm_get/put() around not only ioctls, but all region accesses to avoid > the sorts of races you previously identified. Access through mmap > should still arguably fault given that there is no discrete end to such > an access like we have for read/write operations. Thanks, > > Alex > Thanks Alex for confirming. I will do the same. Regards, Abhishek