On 2/9/2022 2:56 AM, Alex Williamson wrote: > On Tue, 8 Feb 2022 00:50:08 +0530 > Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > >> On 2/2/2022 5:01 AM, Alex Williamson wrote: >>> On Tue, 1 Feb 2022 17:06:43 +0530 >>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: >>> >>>> On 2/1/2022 1:41 AM, Alex Williamson wrote: >>>>> On Mon, 31 Jan 2022 16:54:50 +0530 >>>>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: >>>>> >>>>>> If needs_pm_restore is set (PCI device does not have support for no >>>>>> soft reset), then the current PCI state will be saved during D0->D3hot >>>>>> transition and same will be restored back during D3hot->D0 transition. >>>>>> For saving the PCI state locally, pci_store_saved_state() is being >>>>>> used and the pci_load_and_free_saved_state() will free the allocated >>>>>> memory. >>>>>> >>>>>> But for reset related IOCTLs, vfio driver calls PCI reset related >>>>>> API's which will internally change the PCI power state back to D0. So, >>>>>> when the guest resumes, then it will get the current state as D0 and it >>>>>> will skip the call to vfio_pci_set_power_state() for changing the >>>>>> power state to D0 explicitly. In this case, the memory pointed by >>>>>> pm_save will never be freed. In a malicious sequence, the state changing >>>>>> to D3hot followed by VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be >>>>>> run in a loop and it can cause an OOM situation. >>>>>> >>>>>> Also, pci_pm_reset() returns -EINVAL if we try to reset a device that >>>>>> isn't currently in D0. Therefore any path where we're triggering a >>>>>> function reset that could use a PM reset and we don't know if the device >>>>>> is in D0, should wake up the device before we try that reset. >>>>>> >>>>>> This patch changes the device power state to D0 by invoking >>>>>> vfio_pci_set_power_state() before calling reset related API's. >>>>>> It will help in fixing the mentioned memory leak and making sure >>>>>> that the device is in D0 during reset. Also, to prevent any similar >>>>>> memory leak for future development, this patch frees memory first >>>>>> before overwriting 'pm_save'. >>>>>> >>>>>> Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition") >>>>>> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx> >>>>>> --- >>>>>> >>>>>> * Changes in v2 >>>>>> >>>>>> - Add the Fixes tag and sent this patch independently. >>>>>> - Invoke vfio_pci_set_power_state() before invoking reset related API's. >>>>>> - Removed saving of power state locally. >>>>>> - Removed warning before 'kfree(vdev->pm_save)'. >>>>>> - Updated comments and commit message according to updated changes. >>>>>> >>>>>> * v1 of this patch was sent in >>>>>> https://lore.kernel.org/lkml/20220124181726.19174-4-abhsahu@xxxxxxxxxx/ >>>>>> >>>>>> drivers/vfio/pci/vfio_pci_core.c | 27 +++++++++++++++++++++++++++ >>>>>> 1 file changed, 27 insertions(+) >>>>>> >>>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c >>>>>> index f948e6cd2993..d6dd4f7c4b2c 100644 >>>>>> --- a/drivers/vfio/pci/vfio_pci_core.c >>>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c >>>>>> @@ -228,6 +228,13 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat >>>>>> if (!ret) { >>>>>> /* D3 might be unsupported via quirk, skip unless in D3 */ >>>>>> if (needs_save && pdev->current_state >= PCI_D3hot) { >>>>>> + /* >>>>>> + * If somehow, the vfio driver was not able to free the >>>>>> + * memory allocated in pm_save, then free the earlier >>>>>> + * memory first before overwriting pm_save to prevent >>>>>> + * memory leak. >>>>>> + */ >>>>>> + kfree(vdev->pm_save); >>>>>> vdev->pm_save = pci_store_saved_state(pdev); >>>>>> } else if (needs_restore) { >>>>>> pci_load_and_free_saved_state(pdev, &vdev->pm_save); >>>>>> @@ -322,6 +329,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) >>>>>> /* For needs_reset */ >>>>>> lockdep_assert_held(&vdev->vdev.dev_set->lock); >>>>>> >>>>>> + /* >>>>>> + * This function can be invoked while the power state is non-D0, >>>>>> + * Change the device power state to D0 first. >>>>> >>>>> I think we need to describe more why we're doing this than what we're >>>>> doing. We need to make sure the device is in D0 in case we have a >>>>> reset method that depends on that directly, ex. pci_pm_reset(), or >>>>> possibly device specific resets that may access device BAR resources. >>>>> I think it's placed here in the function so that the config space >>>>> changes below aren't overwritten by restoring the saved state and maybe >>>>> also because the set_irqs_ioctl() call might access device MMIO space. >>>>> >>>> >>>> Thanks Alex. >>>> I will add more details here in the comment. >>>> >>>>>> + */ >>>>>> + vfio_pci_set_power_state(vdev, PCI_D0); >>>>>> + >>>>>> /* Stop the device from further DMA */ >>>>>> pci_clear_master(pdev); >>>>>> >>>>>> @@ -921,6 +934,13 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, >>>>>> return -EINVAL; >>>>>> >>>>>> vfio_pci_zap_and_down_write_memory_lock(vdev); >>>>>> + >>>>>> + /* >>>>>> + * This function can be invoked while the power state is non-D0, >>>>>> + * Change the device power state to D0 before doing reset. >>>>>> + */ >>>>> >>>>> See below, reconsidering this... >>>>> >>>>>> + vfio_pci_set_power_state(vdev, PCI_D0); >>>>>> + >>>>>> ret = pci_try_reset_function(vdev->pdev); >>>>>> up_write(&vdev->memory_lock); >>>>>> >>>>>> @@ -2055,6 +2075,13 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, >>>>>> } >>>>>> cur_mem = NULL; >>>>>> >>>>>> + /* >>>>>> + * This function can be invoked while the power state is non-D0. >>>>>> + * Change power state of all devices to D0 before doing reset. >>>>>> + */ >>>>> >>>>> Here I have trouble convincing myself exactly what we're doing. As you >>>>> note in patch 1/ of the RFC series, pci_reset_bus(), or more precisely >>>>> pci_dev_save_and_disable(), wakes the device to D0 before reset, so we >>>>> can't be doing this only to get the device into D0. The function level >>>>> resets do the same. >>>>> >>>>> Actually, now I'm remembering and debugging where I got myself confused >>>>> previously with pci_pm_reset(). The scenario was a Windows guest with >>>>> an assigned Intel 82574L NIC. When doing a shutdown from the guest the >>>>> device is placed in D3hot and we enter vfio_pci_core_disable() in that >>>>> state. That function however uses __pci_reset_function_locked(), which >>>>> skips the pci_dev_save_and_disable() since much of it is redundant for >>>>> that call path (I think I generalized this to all flavors of >>>>> pci_reset_function() in my head). >>>> >>>> Thanks for providing the background related with the original issue. >>>> >>>>> >>>>> The standard call to pci_try_reset_function(), as in the previous >>>>> chunk, will make use of pci_dev_save_and_disable(), so for either of >>>>> these latter cases the concern cannot be simply having the device in D0, >>>>> we need a reason that we want the previously saved state restored on the >>>>> device before the reset, and thus restored to the device after the >>>>> reset as the rationale for the change. >>>>> >>>> >>>> I will add this as a comment. >>>> >>>>>> + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) >>>>>> + vfio_pci_set_power_state(cur, PCI_D0); >>>>>> + >>>>>> ret = pci_reset_bus(pdev); >>>>>> >>>>>> err_undo: >>>>> >>>>> >>>>> We also call pci_reset_bus() in vfio_pci_dev_set_try_reset(). In that >>>>> case, none of the other devices can be in use by the user, but they can >>>>> certainly be in D3hot with previous device state saved off into our >>>>> pm_save cache. If we don't have a good reason to restore in that case, >>>>> I'm wondering if we really have a good reason to restore in the above >>>>> two cases. >>>>> >>>>> Perhaps we just need the first chunk above to resolve the memory leak, >>>> >>>> First chunk means only the changes done in vfio_pci_set_power_state() >>>> which is calling kfree() before calling pci_store_saved_state(). >>>> Or I need to include more things in the first patch ? >>> >>> Correct, first chunk as is the first change in the patch. Patch chunks >>> are delineated by the @@ offset lines. >>> >> >> Thanks for confirming this. >> >>>> >>>> With the kfree(), the original memory leak issue should be solved. >>>> >>>>> and the second chunk as a separate patch to resolve the issue with >>>>> devices entering vfio_pci_core_disable() in non-D0 state. Sorry if I >>>> >>>> And this second patch will contain rest of the things where >>>> we will call vfio_pci_set_power_state() explicitly for moving to >>>> D0 state ? >>> >>> At least the first one in vfio_pci_core_disable(), the others need >>> justification. >>> >> >> Yes. First one is needed. >> >>>> Also, We need to explore if setting to D0 state is really required at >>>> all these places and If it is not required, then we don't need second >>>> patch ? >>> >>> We need a second patch, I'm convinced that we don't otherwise wake the >>> device to D0 before we potentially get to pci_pm_reset() in >>> vfio_pci_core_disable(). It's the remaining cases of setting D0 that >>> I'm less clear on. If it's the case that we need to restore config >>> space any time a NoSoftRst- device is woken from D3hot and the state >>> saved and restored around the reset is meaningless otherwise, that's a >>> valid justification, but is it accurate? If so, we should recheck the >>> other case of calling pci_reset_bus() too. Thanks, >>> >>> Alex >>> >> >> I was analyzing this part in detail and added some debug prints and >> made user space program to understand it better. Also, I have gone >> through the patch 51ef3a004b1e (“vfio/pci: Restore device state on >> PM transition”). >> >> We have 2 cases here: >> >> 1. The devices which has NoSoftRst+ (needs_pm_restore is false). >> This case should work fine for all the cases (Apart from vfio_pci_core_disable()) >> without waking-up the device explicitly. >> >> 2. The devices which has NoSoftRst- (needs_pm_restore is true). >> >> For case 2, let’s consider following example: >> >> a. The device is in D3hot. >> b. User made VFIO_DEVICE_RESET ioctl. >> c. pci_try_reset_function() will be called which internally >> invokes pci_dev_save_and_disable(). >> d. pci_set_power_state(dev, PCI_D0) will be called first. >> e. pci_save_state() will happen then. >> >> Now, for the devices which has NoSoftRst-, >> the pci_set_power_state() should trigger soft reset and >> we may lose the original state at step (d) and this state >> cannot be restored. >> >> For example, lets assume the case, where SBIOS or host >> linux kernel (In the aspm.c) enables PCIe LTR setting for the >> PCIe device. When this soft reset will be triggered, then this >> LTR setting may be reset, and the device state saved at step (e) >> will also have this setting cleared so it cannot be restored. >> Same thing can be happened for other PCIe capabilities. Since the >> vfio driver only exposes limited enhanced capabilities to its user >> So, the vfio-driver user also won’t have option to save and >> restore these capabilities state and these original >> settings will be permanently lost. > > Yes, this is my concern, thanks for confirming. > >> So, it seems we need to always move the device explicitly to >> D0 state by calling vfio_pci_set_power_state() before >> any reset for the reset triggered by IOCTLs. This is mainly to >> preserve the state around soft reset. > > The other option would be to test for vdev->pm_save and if found do a > load-and-free + restore-state after reset. For simplicity, I'd tend to > favor your approach to wake the device with vfio_pci_set_power_state() > before reset. Either approach seems roughly equal to me. > >> For vfio_pci_dev_set_try_reset() also, we can have the above >> mentioned situation. The other functions/devices can be in D3hot >> state and the D0 transition can cause soft reset there also. >> For example, in my case, NVIDIA GPU has VGA (func 0) and audio >> (func 1) function. I added debug print to dump the current state >> before and after pci_reset_bus(). Before pci_reset_bus() the func 1 >> state was D3hot and after pci_reset_bus() the func 1 state got >> changed to D0. This pci_reset_bus() was called during closing of >> func 0 device so there are chances of soft reset for other >> function/devices. > > Ok, so we'll always wakeup devices for both the pci_reset_function() > class of resets and bus resets. This seems to logically fit with the > fix to wakeup the device on release, so shall we do patch 1 of the > fixes series includes the kfree only and patch 2 resolves all the cases > of waking devices before reset? Thanks, > > Alex > Thanks. I will make 2 patches and will send for review. Regards, Abhishek