Re: [PATCH v3 3/8] drm/amdgpu: Block all job scheduling activity during DPC recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 8/31/20 5:00 PM, Luben Tuikov wrote:
On 2020-08-31 11:50 a.m., Andrey Grodzovsky wrote:
DPC recovery involves ASIC reset just as normal GPU recovery so blosk
Again, typo: "blosk" --> "blocks".

SW GPU schedulers and wait on all concurrent GPU resets.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Acked-by: Alex Deucher <alexander.deucher@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 +++++++++++++++++++++++++++---
  1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 43ce473..c569523 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4743,6 +4743,20 @@ int amdgpu_device_baco_exit(struct drm_device *dev)
  	return 0;
  }
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+	int i;
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+		cancel_delayed_work_sync(&ring->sched.work_tdr);
+	}
+}
+
  /**
   * amdgpu_pci_error_detected - Called when a PCI error is detected.
   * @pdev: PCI device struct
@@ -4756,15 +4770,37 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
  {
  	struct drm_device *dev = pci_get_drvdata(pdev);
  	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
DRM_INFO("PCI error: detected callback, state(%d)!!\n", state); switch (state) {
  	case pci_channel_io_normal:
  		return PCI_ERS_RESULT_CAN_RECOVER;
-	case pci_channel_io_frozen:
-		/* Fatal error, prepare for slot reset */
-		amdgpu_device_lock_adev(adev);
+	/* Fatal error, prepare f
+	case pci_channel_io_frozen:		
+		/*		
+		 * Cancel and wait for all TDRs in progress if failing to
+		 * set  adev->in_gpu_reset in amdgpu_device_lock_adev
+		 *
+		 * Locking adev->reset_sem will prevent any external access
+		 * to GPU during PCI error recovery
+		 */
+		while (!amdgpu_device_lock_adev(adev, NULL))
+			amdgpu_cancel_all_tdr(adev);
As in v1, I don't see how this is protected from
polling forever. "amdgpu_cancel_all_tdr()" defined above,
doesn't actively cancel anything--it just waits.

"amdgpu_device_lock_adev()" similarly only modifies
a few variables.

Neither of those two functions seem to escalate or otherwise
reset the hardware. If the reset/hw blocks for whatever
reason, this loop will hang forever.

Regards,
Luben


But the fact that reset blocks for ever for some reason is not acceptable anyway - it's a bug we must handle and so I don't think our code here should be robust for this bug case because if we are in this scenario already we will have a lot of other issues anyway and the driver will stop being usable. Your comment made me see there is a question of possible starvation when trying to amdgpu_device_lock_adev because we could say that if the HW is dead and jobs keep coming you will have new job timeouts all the time and they could sneak in and hijack adev->in_gpu_reset so that PCIe error recovery never has a chance to run but, I looked into the code and that is not the case since drm_sched_select_entity->drm_sched_ready will return NULL for any scheduler that already has his HW ring full of unfinished jobs and so no new timeouts will be triggered so PCIe recovery code will have eventually his opportunity to lock the device and run.

Andrey



+
+		/*
+		 * Block any work scheduling as we do for regular GPU reset
+		 * for the duration of the recovery
+		 */
+		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+			struct amdgpu_ring *ring = adev->rings[i];
+
+			if (!ring || !ring->sched.thread)
+				continue;
+
+			drm_sched_stop(&ring->sched, NULL);
+		}
  		return PCI_ERS_RESULT_NEED_RESET;
  	case pci_channel_io_perm_failure:
  		/* Permanent error, prepare for device removal */
@@ -4897,8 +4933,21 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
  {
  	struct drm_device *dev = pci_get_drvdata(pdev);
  	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
- amdgpu_device_unlock_adev(adev); DRM_INFO("PCI error: resume callback!!\n");
+
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+
+		drm_sched_resubmit_jobs(&ring->sched);
+		drm_sched_start(&ring->sched, true);
+	}
+
+	amdgpu_device_unlock_adev(adev);
  }

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux