On 2020-08-31 11:50 a.m., Andrey Grodzovsky wrote: > DPC recovery involves ASIC reset just as normal GPU recovery so blosk Again, typo: "blosk" --> "blocks". > SW GPU schedulers and wait on all concurrent GPU resets. > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > Acked-by: Alex Deucher <alexander.deucher@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++--- > 1 file changed, 53 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 43ce473..c569523 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4743,6 +4743,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev) > return 0; > } > > +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev) > +{ > + int i; > + > + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { > + struct amdgpu_ring *ring = adev->rings[i]; > + > + if (!ring || !ring->sched.thread) > + continue; > + > + cancel_delayed_work_sync(&ring->sched.work_tdr); > + } > +} > + > /** > * amdgpu_pci_error_detected - Called when a PCI error is detected. > * @pdev: PCI device struct > @@ -4756,15 +4770,37 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta > { > struct drm_device *dev = pci_get_drvdata(pdev); > struct amdgpu_device *adev = drm_to_adev(dev); > + int i; > > DRM_INFO("PCI error: detected callback, state(%d)!!\n", state); > > switch (state) { > case pci_channel_io_normal: > return PCI_ERS_RESULT_CAN_RECOVER; > - case pci_channel_io_frozen: > - /* Fatal error, prepare for slot reset */ > - amdgpu_device_lock_adev(adev); > + /* Fatal error, prepare f > + case pci_channel_io_frozen: > + /* > + * Cancel and wait for all TDRs in progress if failing to > + * set adev->in_gpu_reset in amdgpu_device_lock_adev > + * > + * Locking adev->reset_sem will prevent any external access > + * to GPU during PCI error recovery > + */ > + while (!amdgpu_device_lock_adev(adev, NULL)) > + amdgpu_cancel_all_tdr(adev); As in v1, I don't see how this is protected from polling forever. "amdgpu_cancel_all_tdr()" defined above, doesn't actively cancel anything--it just waits. "amdgpu_device_lock_adev()" similarly only modifies a few variables. Neither of those two functions seem to escalate or otherwise reset the hardware. If the reset/hw blocks for whatever reason, this loop will hang forever. Regards, Luben > + > + /* > + * Block any work scheduling as we do for regular GPU reset > + * for the duration of the recovery > + */ > + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { > + struct amdgpu_ring *ring = adev->rings[i]; > + > + if (!ring || !ring->sched.thread) > + continue; > + > + drm_sched_stop(&ring->sched, NULL); > + } > return PCI_ERS_RESULT_NEED_RESET; > case pci_channel_io_perm_failure: > /* Permanent error, prepare for device removal */ > @@ -4897,8 +4933,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev) > { > struct drm_device *dev = pci_get_drvdata(pdev); > struct amdgpu_device *adev = drm_to_adev(dev); > + int i; > > - amdgpu_device_unlock_adev(adev); > > DRM_INFO("PCI error: resume callback!!\n"); > + > + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { > + struct amdgpu_ring *ring = adev->rings[i]; > + > + if (!ring || !ring->sched.thread) > + continue; > + > + > + drm_sched_resubmit_jobs(&ring->sched); > + drm_sched_start(&ring->sched, true); > + } > + > + amdgpu_device_unlock_adev(adev); > } > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx