On 6/2/2022 11:14 PM, Alex Williamson wrote: > On Thu, 2 Jun 2022 17:22:03 +0530 > Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > >> On 6/1/2022 9:51 PM, Alex Williamson wrote: >>> On Wed, 1 Jun 2022 15:19:07 +0530 >>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: >>> >>>> On 6/1/2022 4:22 AM, Alex Williamson wrote: >>>>> On Tue, 31 May 2022 16:43:04 -0300 >>>>> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: >>>>> >>>>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote: >>>>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote: >>>>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote: >>>>>>>> >>>>>>>>> 1. In real use case, config or any other ioctl should not come along >>>>>>>>> with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request. >>>>>>>>> >>>>>>>>> 2. Maintain some 'access_count' which will be incremented when we >>>>>>>>> do any config space access or ioctl. >>>>>>>> >>>>>>>> Please don't open code locks - if you need a lock then write a proper >>>>>>>> lock. You can use the 'try' variants to bail out in cases where that >>>>>>>> is appropriate. >>>>>>>> >>>>>>>> Jason >>>>>>> >>>>>>> Thanks Jason for providing your inputs. >>>>>>> >>>>>>> In that case, should I introduce new rw_semaphore (For example >>>>>>> power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ? >>>>>> >>>>>> Possibly, this is better than an atomic at least >>>>>> >>>>>>> 1. At the beginning of config space access or ioctl, we can take the >>>>>>> lock >>>>>>> >>>>>>> down_read(&vdev->power_lock); >>>>>> >>>>>> You can also do down_read_trylock() here and bail out as you were >>>>>> suggesting with the atomic. >>>>>> >>>>>> trylock doesn't have lock odering rules because it can't sleep so it >>>>>> gives a bit more flexability when designing the lock ordering. >>>>>> >>>>>> Though userspace has to be able to tolerate the failure, or never make >>>>>> the request. >>>>>> >>>> >>>> Thanks Alex and Jason for providing your inputs. >>>> >>>> Using down_read_trylock() along with Alex suggestion seems fine. >>>> In real use case, config space access should not happen when the >>>> device is in low power state so returning error should not >>>> cause any issue in this case. >>>> >>>>>>> down_write(&vdev->power_lock); >>>>>>> ... >>>>>>> switch (vfio_pm.low_power_state) { >>>>>>> case VFIO_DEVICE_LOW_POWER_STATE_ENTER: >>>>>>> ... >>>>>>> vfio_pci_zap_and_down_write_memory_lock(vdev); >>>>>>> vdev->power_state_d3 = true; >>>>>>> up_write(&vdev->memory_lock); >>>>>>> >>>>>>> ... >>>>>>> up_write(&vdev->power_lock); >>>>>> >>>>>> And something checks the power lock before allowing the memor to be >>>>>> re-enabled? >>>>>> >>>>>>> 4. For ioctl access, as mentioned previously I need to add two >>>>>>> callbacks functions (one for start and one for end) in the struct >>>>>>> vfio_device_ops and call the same at start and end of ioctl from >>>>>>> vfio_device_fops_unl_ioctl(). >>>>>> >>>>>> Not sure I followed this.. >>>>> >>>>> I'm kinda lost here too. >>>> >>>> >>>> I have summarized the things below >>>> >>>> 1. In the current patch (v3 8/8), if config space access or ioctl was >>>> being made by the user when the device is already in low power state, >>>> then it was waking the device. This wake up was happening with >>>> pm_runtime_resume_and_get() API in vfio_pci_config_rw() and >>>> vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series). >>>> >>>> 2. Now, it has been decided to return error instead of waking the >>>> device if the device is already in low power state. >>>> >>>> 3. Initially I thought to add following code in config space path >>>> (and similar in ioctl) >>>> >>>> vfio_pci_config_rw() { >>>> ... >>>> down_read(&vdev->memory_lock); >>>> if (vdev->platform_pm_engaged) >>>> { >>>> up_read(&vdev->memory_lock); >>>> return -EIO; >>>> } >>>> ... >>>> } >>>> >>>> And then there was a possibility that the physical config happens >>>> when the device in D3cold in case of race condition. >>>> >>>> 4. So, I wanted to add some mechanism so that the low power entry >>>> ioctl will be serialized with other ioctl or config space. With this >>>> if low power entry gets scheduled first then config/other ioctls will >>>> get failure, otherwise low power entry will wait. >>>> >>>> 5. For serializing this access, I need to ensure that lock is held >>>> throughout the operation. For config space I can add the code in >>>> vfio_pci_config_rw(). But for ioctls, I was not sure what is the best >>>> way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION, >>>> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the >>>> vfio core layer itself. >>>> >>>> The memory_lock and the variables to track low power in specific to >>>> vfio-pci so I need some mechanism by which I add low power check for >>>> each ioctl. For serialization, I need to call function implemented in >>>> vfio-pci before vfio core layer makes the actual ioctl to grab the >>>> locks. Similarly, I need to release the lock once vfio core layer >>>> finished the actual ioctl. I have mentioned about this problem in the >>>> above point (point 4 in my earlier mail). >>>> >>>>> A couple replies back there was some concern >>>>> about race scenarios with multiple user threads accessing the device. >>>>> The ones concerning non-deterministic behavior if a user is >>>>> concurrently changing power state and performing other accesses are a >>>>> non-issue, imo. >>>> >>>> What does non-deterministic behavior here mean. >>>> Is it for user side that user will see different result >>>> (failure or success) during race condition or in the kernel side >>>> (as explained in point 3 above where physical config access >>>> happens when the device in D3cold) ? My concern here is for later >>>> part where this config space access in D3cold can cause fatal error >>>> on the system side as we have seen for memory disablement. >>> >>> Yes, our only concern should be to prevent such an access. The user >>> seeing non-deterministic behavior, such as during concurrent power >>> control and config space access, all combinations of success/failure >>> are possible, is par for the course when we decide to block accesses >>> across the life of the low power state. >>> >>>>> I think our goal is only to expand the current >>>>> memory_lock to block accesses, including config space, while the device >>>>> is in low power, or some approximation bounded by the entry/exit ioctl. >>>>> >>>>> I think the remaining issues is how to do that relative to the fact >>>>> that config space access can change the memory enable state and would >>>>> therefore need to upgrade the memory_lock read-lock to a write-lock. >>>>> For that I think we can simply drop the read-lock, acquire the >>>>> write-lock, and re-test the low power state. If it has changed, that >>>>> suggests the user has again raced changing power state with another >>>>> access and we can simply drop the lock and return -EIO. >>>>> >>>> >>>> Yes. This looks better option. So, just to confirm, I can take the >>>> memory_lock read-lock at the starting of vfio_pci_config_rw() and >>>> release it just before returning from vfio_pci_config_rw() and >>>> for memory related config access, we will release this lock and >>>> re-aquiring again write version of this. Once memory write happens, >>>> then we can downgrade this write lock to read lock ? >>> >>> We only need to lock for the device access, so if you've finished that >>> access after acquiring the write-lock, there'd be no point to then >>> downgrade that to a read-lock. The access should be finished by that >>> point. >>> >> >> I was planning to take memory_lock read-lock at the beginning of >> vfio_pci_config_rw() and release the same just before returning from >> this function. If I don't downgrade it back to read-lock, then the >> release in the end will be called for the lock which has not taken. >> Also, user can specify count to any number of bytes and then the >> vfio_config_do_rw() will be invoked multiple times and then in >> the second call, it will be without lock. > > Ok, yes, I can imagine how it might result in a cleaner exit path to do > a downgrade_write(). > >>>> Also, what about IOCTLs. How can I take and release memory_lock for >>>> ioctl. is it okay to go with Patch 7 where we call >>>> pm_runtime_resume_and_get() before each ioctl or we need to do the >>>> same low power check for ioctl also ? >>>> In Later case, I am not sure how should I do the implementation so >>>> that all other ioctl are covered from vfio core layer itself. >>> >>> Some ioctls clearly cannot occur while the device is in low power, such >>> as resets and interrupt control, but even less obvious things like >>> getting region info require device access. Migration also provides a >>> channel to device access. Do we want to manage a list of ioctls that >>> are allowed in low power, or do we only want to allow the ioctl to exit >>> low power? >>> >> >> In previous version of this patch, you mentioned that maintaining the >> safe ioctl list will be tough to maintain. So, currently we wanted to >> allow the ioctl for low power exit. > > Yes, I'm still conflicted in how that would work. > >>> I'm also still curious how we're going to handle devices that cannot >>> return to low power such as the self-refresh mode on the GPU. We can >>> potentially prevent any wake-ups from the vfio device interface, but >>> that doesn't preclude a wake-up via an external lspci. I think we need >>> to understand how we're going to handle such devices before we can >>> really complete the design. AIUI, we cannot disable the self-refresh >>> sleep mode without imposing unreasonable latency and memory >>> requirements on the guest and we cannot retrigger the self-refresh >>> low-power mode without non-trivial device specific code. Thanks, >>> >>> Alex >>> >> >> I am working on adding support to notify guest through virtual PME >> whenever there is any wake-up triggered by the host and the guest has >> already put the device into runtime suspended state. This virtual PME >> will be similar to physical PME. Normally, if PCI device need power >> management transition, then it sends PME event which will be >> ultimately handled by host OS. In virtual PME case, if host need power >> management transition, then it sends event to guest and then guest OS >> handles these virtual PME events. Following is summary: >> >> 1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX >> named VFIO_PCI_PME_IRQ_INDEX and add the required code for this >> virtual PME event. >> >> 2. From the guest side, when the PME_IRQ is enabled then we will >> set event_fd for PME. >> >> 3. In the vfio driver, the PME support bits are already >> virtualized and currently set to 0. We can set PME capability support >> for D3cold so that in guest, it looks like >> >> Capabilities: [60] Power Management version 3 >> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA >> PME(D0-,D1-,D2-,D3hot-,D3cold+) >> >> 4. From the guest side, it can do PME enable (PME_En bit in Power >> Management Control/Status Register) which will be again virtualized. >> >> 5. When host gets request for resuming the device other than from >> low power ioctl, then device pm usage count will be incremented, the >> PME status (PME_Status bit in Power Management Control/Status Register) >> will be set and then we can do the event_fd signal. >> >> 6. In the PCIe, the PME events will be handled by root port. For >> using low power D3cold feature, it is required to create virtual root >> port in hypervisor side and when hypervisor receives this PME event, >> then it can send virtual interrupt to root port. >> >> 7. If we take example of Linux kernel, then pcie_pme_irq() will >> handle this and then do the runtime resume on the guest side. Also, it >> will clear the PME status bit here. Then guest can put the device >> again into suspended state. >> >> 8. I did prototype changes in QEMU for above logic and was getting wake-up >> in the guest whenever I do lspci on the host side. >> >> 9. Since currently only nvidia GPU has this limitation to require >> driver interaction each time before going into D3cold so we can allow >> the reentry for other device. We can have nvidia vendor (along with >> VGA/3D controller class code). In future, if any other device also has >> similar requirement then we can update this list. For other device >> host can put the device into D3cold in case of any wake-up. >> >> 10. In the vfio driver, we can put all these restriction for >> enabling PME and return error if user tries to make low power entry >> ioctl without enabling the PME related things. >> >> 11. The virtual PME can help in handling physical PME also for all >> the devices. The PME logic is not dependent upon nvidia GPU >> restriction. If virtual PME is enabled by hypervisor, then when >> physical PME wakes the device, then it will resume on the guest side >> also. > > So if host accesses through things like lspci are going to wake the > device and we can't prevent that, and the solution to that is to notify > the guest to put the device back to low power, then it seems a lot less > important to try to prevent the user from waking the device through > random accesses. In that context, maybe we do simply wrap all accesses > with pm_runtime_get/put() put calls, which eliminates the problem of > maintaining a list of safe ioctls in low power. > So wrap all access with pm_runtime_get()/put() will only be applicable for IOCTLs. Correct ? For config space, we can go with the approach discussed earlier in which we return error ? > I'd probably argue that whether to allow the kernel to put the device > back to low power directly is a policy decision and should therefore be > directed by userspace. For example the low power entry ioctl would > have a flag to indicate the desired behavior and QEMU might have an > on/off/[auto] vfio-pci device option which allows configuration of that > behavior. The default auto policy might direct for automatic low-power > re-entry except for NVIDIA VGA/3D class codes and other devices we > discover that need it. This lets us have an immediate workaround for > devices requiring guest support without a new kernel. > Yes. That is better option. I will do the changes. > This PME notification to the guest is really something that needs to be > part of the base specification for user managed low power access due to > these sorts of design decisions. Thanks, > > Alex > Yes. I will include this in my next patch series. Regards, Abhishek