On 1/28/2022 5:35 AM, Alex Williamson wrote: > External email: Use caution opening links or attachments > > > On Mon, 24 Jan 2022 23:47:24 +0530 > Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote: > >> If needs_pm_restore is set (PCI device does not have support for no >> soft reset), then the current PCI state will be saved during D0->D3hot >> transition and same will be restored back during D3hot->D0 transition. >> For saving the PCI state locally, pci_store_saved_state() is being >> used and the pci_load_and_free_saved_state() will free the allocated >> memory. >> >> But for reset related IOCTLs, vfio driver calls PCI reset related >> API's which will internally change the PCI power state back to D0. So, >> when the guest resumes, then it will get the current state as D0 and it >> will skip the call to vfio_pci_set_power_state() for changing the >> power state to D0 explicitly. In this case, the memory pointed by >> pm_save will never be freed. >> >> Also, in malicious sequence, the state changing to D3hot followed by >> VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and >> it can cause an OOM situation. This patch stores the power state locally >> and uses the same for comparing the current power state. For the >> places where D0 transition can happen, call vfio_pci_set_power_state() >> to transition to D0 state. Since the vfio power state is still D3hot, >> so this D0 transition will help in running the logic required >> from D3hot->D0 transition. Also, to prevent any miss during >> future development to detect this condition, this patch puts a >> check and frees the memory after printing warning. >> >> This locally saved power state will help in subsequent patches >> also. > > Ideally let's put fixes patches at the start of the series, or better > yet send them separately, and don't include changes that only make > sense in the context of a subsequent patch. > > Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition") > Thanks Alex for reviewing this patch. I have added Fixes tag and sent this patch separately. Should I update this patch series or you are planning to review the other patches first of this patch series first. >> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx> >> --- >> drivers/vfio/pci/vfio_pci_core.c | 53 ++++++++++++++++++++++++++++++-- >> include/linux/vfio_pci_core.h | 1 + >> 2 files changed, 51 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c >> index c6e4fe9088c3..ee2fb8af57fa 100644 >> --- a/drivers/vfio/pci/vfio_pci_core.c >> +++ b/drivers/vfio/pci/vfio_pci_core.c >> @@ -206,6 +206,14 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev) >> * restore when returned to D0. Saved separately from pci_saved_state for use >> * by PM capability emulation and separately from pci_dev internal saved state >> * to avoid it being overwritten and consumed around other resets. >> + * >> + * There are few cases where the PCI power state can be changed to D0 >> + * without the involvement of this API. So, cache the power state locally >> + * and call this API to update the D0 state. It will help in running the >> + * logic that is needed for transitioning to the D0 state. For example, >> + * if needs_pm_restore is set, then the PCI state will be saved locally. >> + * The memory taken for saving this PCI state needs to be freed to >> + * prevent memory leak. >> */ >> int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state) >> { >> @@ -214,20 +222,34 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat >> int ret; >> >> if (vdev->needs_pm_restore) { >> - if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) { >> + if (vdev->power_state < PCI_D3hot && state >= PCI_D3hot) { >> pci_save_state(pdev); >> needs_save = true; >> } >> >> - if (pdev->current_state >= PCI_D3hot && state <= PCI_D0) >> + if (vdev->power_state >= PCI_D3hot && state <= PCI_D0) >> needs_restore = true; >> } >> >> ret = pci_set_power_state(pdev, state); >> >> if (!ret) { >> + vdev->power_state = pdev->current_state; >> + >> /* D3 might be unsupported via quirk, skip unless in D3 */ >> - if (needs_save && pdev->current_state >= PCI_D3hot) { >> + if (needs_save && vdev->power_state >= PCI_D3hot) { >> + /* >> + * If somehow, the vfio driver was not able to free the >> + * memory allocated in pm_save, then free the earlier >> + * memory first before overwriting pm_save to prevent >> + * memory leak. >> + */ >> + if (vdev->pm_save) { >> + pci_warn(pdev, >> + "Overwriting saved PCI state pointer so freeing the earlier memory\n"); >> + kfree(vdev->pm_save); >> + } > > The minimal fix for the described issue would simply be: > > kfree(vdev->pm_save); > > It seems like the only purpose of the warning is try to make sure we > haven't missed any wake-up calls, where this would be a pretty small > breadcrumb to actually debug such an issue. > I have removed the warning in the updated patch. >> + >> vdev->pm_save = pci_store_saved_state(pdev); >> } else if (needs_restore) { >> pci_load_and_free_saved_state(pdev, &vdev->pm_save); >> @@ -326,6 +348,14 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) >> /* For needs_reset */ >> lockdep_assert_held(&vdev->vdev.dev_set->lock); >> >> + /* >> + * If disable has been called while the power state is other than D0, >> + * then set the power state in vfio driver to D0. It will help >> + * in running the logic needed for D0 power state. The subsequent >> + * runtime PM API's will put the device into the low power state again. >> + */ >> + vfio_pci_set_power_state(vdev, PCI_D0); >> + > > I do think we have an issue here, but the reason is that pci_pm_reset() > returns -EINVAL if we try to reset a device that isn't currently in D0. > Therefore any path where we're triggering a function reset that could > use a PM reset and we don't know if the device is in D0, should wake up > the device before we try that reset. > > We're about to load the initial state of the device that was saved when > it was opened, so I don't think pdev->current_state vs > vdev->power_state matters here, we only care that the device is in D0. > I have added this point in the commit message of the updated patch. >> /* Stop the device from further DMA */ >> pci_clear_master(pdev); >> >> @@ -929,6 +959,15 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, >> >> vfio_pci_zap_and_down_write_memory_lock(vdev); >> ret = pci_try_reset_function(vdev->pdev); >> + >> + /* >> + * If pci_try_reset_function() has been called while the power >> + * state is other than D0, then pci_try_reset_function() will >> + * internally set the device state to D0 without vfio driver >> + * interaction. Update the power state in vfio driver to perform >> + * the logic needed for D0 power state. >> + */ >> + vfio_pci_set_power_state(vdev, PCI_D0); > > For the case where pci_try_reset_function() might use a PM reset, we > should set D0 before that call. In doing so, the pdev->current_state > should match the actual device power state, so we still don't need to > stash power state on the vdev. > I have set D0 state before calling pci_try_reset_function() in the updated patch. >> up_write(&vdev->memory_lock); >> >> return ret; >> @@ -2071,6 +2110,14 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, >> >> err_undo: >> list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { >> + /* >> + * If pci_reset_bus() has been called while the power >> + * state is other than D0, then pci_reset_bus() will >> + * internally set the device state to D0 without vfio driver >> + * interaction. Update the power state in vfio driver to perform >> + * the logic needed for D0 power state. >> + */ >> + vfio_pci_set_power_state(cur, PCI_D0); > > Here pci_reset_bus() will wakeup the device and I think the concern is > that around that bus reset we'll save and restore the device state, but > that's potentially bogus device state if waking it triggers a soft > reset. We could again wake devices before the reset to make the state > correct, or we could test pm_save and perform the load and restore if > it exists. Either of those would avoid needing to cache the power > state on the vdev. Thanks, > I have made the changes to wake-up the devices. Thanks Abhishek > Alex > >> if (cur == cur_mem) >> is_mem = false; >> if (cur == cur_vma) >> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h >> index aafe09c9fa64..05db838e72cc 100644 >> --- a/include/linux/vfio_pci_core.h >> +++ b/include/linux/vfio_pci_core.h >> @@ -124,6 +124,7 @@ struct vfio_pci_core_device { >> bool needs_reset; >> bool nointx; >> bool needs_pm_restore; >> + pci_power_t power_state; >> struct pci_saved_state *pci_saved_state; >> struct pci_saved_state *pm_save; >> int ioeventfds_nr; >