Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx> · Tue, 19 Jan 2021 09:33:36 -0500

On 1/19/21 7:22 AM, Horace Chen wrote:
Fix a racing issue when jobs on 2 rings timeout simultaneously.

If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
will be reentered. Then the adev->gmc.xgmi.head will be grabbed
by 2 local linked list, which may cause wild pointer issue in
iterating.

lock the device earily to prevent the node be added to 2 different
lists.

Signed-off-by: Horace Chen <horace.chen@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
  1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4d434803fb49..9574da3abc32 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  	int i, r = 0;
  	bool need_emergency_restart = false;
  	bool audio_suspended = false;
+	bool get_dev_lock = false;
  
  	/*
  	 * Special case: RAS triggered and full reset isn't supported
@@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  	 * Build list of devices to reset.
  	 * In case we are in XGMI hive mode, resort the device list
  	 * to put adev in the 1st position.
+	 *
+	 * lock the device before we try to operate the linked list
+	 * if didn't get the device lock, don't touch the linked list since
+	 * others may iterating it.
  	 */
  	INIT_LIST_HEAD(&device_list);
  	if (adev->gmc.xgmi.num_physical_nodes > 1) {
  		if (!hive)
  			return -ENODEV;
-		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
-			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
-		device_list_handle = &hive->device_list;
+
+		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
+			get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
+			if (!get_dev_lock)
+				break;


What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in 
the current code.

Also, maybe now it's better to separate the actual locking in 
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle 
stuff
like adev->mp1_state back and forth and also the function name is not 
descriptive of
the other stuff going on there anyway.

Andrey


+		}
+		if (get_dev_lock) {
+			if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
+				list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
+			device_list_handle = &hive->device_list;
+		}
  	} else {
-		list_add_tail(&adev->gmc.xgmi.head, &device_list);
-		device_list_handle = &device_list;
+		get_dev_lock = amdgpu_device_lock_adev(adev, hive);
+		tmp_adev = adev;
+		if (get_dev_lock) {
+			list_add_tail(&adev->gmc.xgmi.head, &device_list);
+			device_list_handle = &device_list;
+		}
+	}
+
+	if (!get_dev_lock) {
+		dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
+					job ? job->base.id : -1);
+		r = 0;
+		/* even we skipped this reset, still need to set the job to guilty */
+		goto skip_recovery;
  	}
  
  	/* block all schedulers and reset given job's ring */
  	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-		if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
-			dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
-				  job ? job->base.id : -1);
-			r = 0;
-			goto skip_recovery;
-		}
-
  		/*
  		 * Try to put the audio codec into suspend state
  		 * before gpu reset started.
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx