On 2/1/2022 1:41 AM, Alex Williamson wrote: > On Mon, 31 Jan 2022 16:54:50 +0530 > Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > >> If needs_pm_restore is set (PCI device does not have support for no >> soft reset), then the current PCI state will be saved during D0->D3hot >> transition and same will be restored back during D3hot->D0 transition. >> For saving the PCI state locally, pci_store_saved_state() is being >> used and the pci_load_and_free_saved_state() will free the allocated >> memory. >> >> But for reset related IOCTLs, vfio driver calls PCI reset related >> API's which will internally change the PCI power state back to D0. So, >> when the guest resumes, then it will get the current state as D0 and it >> will skip the call to vfio_pci_set_power_state() for changing the >> power state to D0 explicitly. In this case, the memory pointed by >> pm_save will never be freed. In a malicious sequence, the state changing >> to D3hot followed by VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be >> run in a loop and it can cause an OOM situation. >> >> Also, pci_pm_reset() returns -EINVAL if we try to reset a device that >> isn't currently in D0. Therefore any path where we're triggering a >> function reset that could use a PM reset and we don't know if the device >> is in D0, should wake up the device before we try that reset. >> >> This patch changes the device power state to D0 by invoking >> vfio_pci_set_power_state() before calling reset related API's. >> It will help in fixing the mentioned memory leak and making sure >> that the device is in D0 during reset. Also, to prevent any similar >> memory leak for future development, this patch frees memory first >> before overwriting 'pm_save'. >> >> Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition") >> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx> >> --- >> >> * Changes in v2 >> >> - Add the Fixes tag and sent this patch independently. >> - Invoke vfio_pci_set_power_state() before invoking reset related API's. >> - Removed saving of power state locally. >> - Removed warning before 'kfree(vdev->pm_save)'. >> - Updated comments and commit message according to updated changes. >> >> * v1 of this patch was sent in >> https://lore.kernel.org/lkml/20220124181726.19174-4-abhsahu@xxxxxxxxxx/ >> >> drivers/vfio/pci/vfio_pci_core.c | 27 +++++++++++++++++++++++++++ >> 1 file changed, 27 insertions(+) >> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c >> index f948e6cd2993..d6dd4f7c4b2c 100644 >> --- a/drivers/vfio/pci/vfio_pci_core.c >> +++ b/drivers/vfio/pci/vfio_pci_core.c >> @@ -228,6 +228,13 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat >> if (!ret) { >> /* D3 might be unsupported via quirk, skip unless in D3 */ >> if (needs_save && pdev->current_state >= PCI_D3hot) { >> + /* >> + * If somehow, the vfio driver was not able to free the >> + * memory allocated in pm_save, then free the earlier >> + * memory first before overwriting pm_save to prevent >> + * memory leak. >> + */ >> + kfree(vdev->pm_save); >> vdev->pm_save = pci_store_saved_state(pdev); >> } else if (needs_restore) { >> pci_load_and_free_saved_state(pdev, &vdev->pm_save); >> @@ -322,6 +329,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) >> /* For needs_reset */ >> lockdep_assert_held(&vdev->vdev.dev_set->lock); >> >> + /* >> + * This function can be invoked while the power state is non-D0, >> + * Change the device power state to D0 first. > > I think we need to describe more why we're doing this than what we're > doing. We need to make sure the device is in D0 in case we have a > reset method that depends on that directly, ex. pci_pm_reset(), or > possibly device specific resets that may access device BAR resources. > I think it's placed here in the function so that the config space > changes below aren't overwritten by restoring the saved state and maybe > also because the set_irqs_ioctl() call might access device MMIO space. > Thanks Alex. I will add more details here in the comment. >> + */ >> + vfio_pci_set_power_state(vdev, PCI_D0); >> + >> /* Stop the device from further DMA */ >> pci_clear_master(pdev); >> >> @@ -921,6 +934,13 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, >> return -EINVAL; >> >> vfio_pci_zap_and_down_write_memory_lock(vdev); >> + >> + /* >> + * This function can be invoked while the power state is non-D0, >> + * Change the device power state to D0 before doing reset. >> + */ > > See below, reconsidering this... > >> + vfio_pci_set_power_state(vdev, PCI_D0); >> + >> ret = pci_try_reset_function(vdev->pdev); >> up_write(&vdev->memory_lock); >> >> @@ -2055,6 +2075,13 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, >> } >> cur_mem = NULL; >> >> + /* >> + * This function can be invoked while the power state is non-D0. >> + * Change power state of all devices to D0 before doing reset. >> + */ > > Here I have trouble convincing myself exactly what we're doing. As you > note in patch 1/ of the RFC series, pci_reset_bus(), or more precisely > pci_dev_save_and_disable(), wakes the device to D0 before reset, so we > can't be doing this only to get the device into D0. The function level > resets do the same. > > Actually, now I'm remembering and debugging where I got myself confused > previously with pci_pm_reset(). The scenario was a Windows guest with > an assigned Intel 82574L NIC. When doing a shutdown from the guest the > device is placed in D3hot and we enter vfio_pci_core_disable() in that > state. That function however uses __pci_reset_function_locked(), which > skips the pci_dev_save_and_disable() since much of it is redundant for > that call path (I think I generalized this to all flavors of > pci_reset_function() in my head). Thanks for providing the background related with the original issue. > > The standard call to pci_try_reset_function(), as in the previous > chunk, will make use of pci_dev_save_and_disable(), so for either of > these latter cases the concern cannot be simply having the device in D0, > we need a reason that we want the previously saved state restored on the > device before the reset, and thus restored to the device after the > reset as the rationale for the change. > I will add this as a comment. >> + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) >> + vfio_pci_set_power_state(cur, PCI_D0); >> + >> ret = pci_reset_bus(pdev); >> >> err_undo: > > > We also call pci_reset_bus() in vfio_pci_dev_set_try_reset(). In that > case, none of the other devices can be in use by the user, but they can > certainly be in D3hot with previous device state saved off into our > pm_save cache. If we don't have a good reason to restore in that case, > I'm wondering if we really have a good reason to restore in the above > two cases. > > Perhaps we just need the first chunk above to resolve the memory leak, First chunk means only the changes done in vfio_pci_set_power_state() which is calling kfree() before calling pci_store_saved_state(). Or I need to include more things in the first patch ? With the kfree(), the original memory leak issue should be solved. > and the second chunk as a separate patch to resolve the issue with > devices entering vfio_pci_core_disable() in non-D0 state. Sorry if I And this second patch will contain rest of the things where we will call vfio_pci_set_power_state() explicitly for moving to D0 state ? Also, We need to explore if setting to D0 state is really required at all these places and If it is not required, then we don't need second patch ? Thanks, Abhishek > mislead you that we had a more widespread issue with pci_pm_reset(), it > wasn't clear in my head until now that it was only the > __pci_reset_function_locked() caller that had the issue. Thanks, > > Alex >