> From: Anthony DeRossi <ajderossi@xxxxxxxxx> > Sent: Sunday, November 6, 2022 6:45 AM > > vfio_pci_dev_set_needs_reset() inspects the open_count of every device > in the set to determine whether a reset is allowed. The current device > always has open_count == 1 within vfio_pci_core_disable(), effectively > disabling the reset logic. This field is also documented as private in > vfio_device, so it should not be used to determine whether other devices > in the set are open. > > Checking for vfio_device_set_open_count() > 1 on the device set fixes > both issues. > > After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set > infrastructure"), failure to create a new file for a device would cause > the reset to be skipped due to open_count being decremented after > calling close_device() in the error path. > > After commit eadd86f835c6 ("vfio: Remove calls to > vfio_group_add_container_user()"), releasing a device would always skip > the reset due to an ordering change in vfio_device_fops_release(). > > Failing to reset the device leaves it in an unknown state, potentially > causing errors when it is accessed later or bound to a different driver. > > This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3) > assigned to a Windows guest. After shutting down the guest, unbinding > the device from vfio-pci, and binding the device to amdgpu: > > [ 548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed! > [ 548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading > failed > [ 548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init > of IP block <psp> failed -22 > [ 548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed > [ 548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init > > Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure") > Fixes: eadd86f835c6 ("vfio: Remove calls to > vfio_group_add_container_user()") > Signed-off-by: Anthony DeRossi <ajderossi@xxxxxxxxx> Reviewed-by: Kevin Tian <kevin.tian@xxxxxxxxx>