In SRIOV, when host driver performs MODE 1 reset and notifies FLR to guest driver, there is a small chance that there is no job running on hw but the driver has not updated the pending list yet, causing the driver not respond the FLR request. Modify the has_job_running function to make sure if there is still running job. v2: Use amdgpu_fence_count_emitted to determine job running status. Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> Signed-off-by: Shikang Fan <shikang.fan@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index b3ca911e55d6..ea756eacebdc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -100,6 +100,7 @@ MODULE_FIRMWARE("amdgpu/navi12_gpu_info.bin"); #define AMDGPU_PCIE_INDEX_FALLBACK (0x38 >> 2) #define AMDGPU_PCIE_INDEX_HI_FALLBACK (0x44 >> 2) #define AMDGPU_PCIE_DATA_FALLBACK (0x3C >> 2) +#define AMDGPU_PENDING_JOB_TIMEOUT (1000000) static const struct drm_driver amdgpu_kms_driver; @@ -5222,15 +5223,19 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, } /** - * amdgpu_device_has_job_running - check if there is any job in mirror list + * amdgpu_device_has_job_running - check if there is any unfinished job * * @adev: amdgpu_device pointer * - * check if there is any job in mirror list + * check if there is any job running on the device when guest driver receives + * FLR notification from host driver. If there are still jobs running and not + * signaled after 1s, the hardware is most likely hung already, then the guest + * driver will not respond the FLR reset. Instead, let the job hit the timeout + * and guest driver then issue the reset request. */ bool amdgpu_device_has_job_running(struct amdgpu_device *adev) { - int i; + int i, j; struct drm_sched_job *job; for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { @@ -5239,11 +5244,12 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev) if (!amdgpu_ring_sched_ready(ring)) continue; - spin_lock(&ring->sched.job_list_lock); - job = list_first_entry_or_null(&ring->sched.pending_list, - struct drm_sched_job, list); - spin_unlock(&ring->sched.job_list_lock); - if (job) + for (j = 0; j < AMDGPU_PENDING_JOB_TIMEOUT; j++) { + if (!amdgpu_fence_count_emitted(ring)) + break; + udelay(1); + } + if (j == AMDGPU_PENDING_JOB_TIMEOUT) return true; } return false; -- 2.34.1