Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info

André Almeida <andrealmeid@xxxxxxxxxx> · Mon, 10 Mar 2025 18:53:48 -0300

Em 01/03/2025 03:04, Raag Jadav escreveu:
On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
Hi Raag,

On 2/28/25 11:58, Raag Jadav wrote:
On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
To notify userspace about which app (if any) made the device get in a
wedge state, make use of drm_wedge_app_info parameter, filling it with
the app PID and name.

Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
   2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 00b9b87dafd8..e06adf6f34fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
   	atomic_set(&adev->reset_domain->reset_res, r);
-	if (!r)
-		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
+	if (!r) {
+		struct drm_wedge_app_info aux, *info = NULL;
+
+		if (job) {
+			struct amdgpu_task_info *ti;
+
+			ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
+			if (ti) {
+				aux.pid = ti->pid;
+				aux.comm = ti->process_name;
+				info = &aux;
+				amdgpu_vm_put_task_info(ti);
+			}
+		}
Is this guaranteed to be guilty app and not some scheduled worker?

This is how amdgpu decides which app is the guilty one earlier in the code
as in the print:

     ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);

     "Process information: process %s pid %d thread %s pid %d\n"

So I think it's consistent with what the driver thinks it's the guilty
process.

Sure, but with something like app_info we're kind of hinting to userspace
that an application was _indeed_ involved with reset. Is that also guaranteed?

Is it possible that an application needlessly suffers from a false positive
scenario (reset due to other factors)?


I asked Alex Deucher in IRC about that and yes, there's a chance that 
this is a false positive. However, for the majority of cases this is the 
right app that caused the hang. This is what amdgpu is doing for GL 
robustness as well and devcoredump, so it's very consistent with how 
amdgpu deals with this scenario even if the mechanism is still not perfect.